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Foreword 



The papers contained in this volume were presented at the 11th Annual Sympo- 
sium on Combinatorial Pattern Matching, held June 21-23, 2000 at the Univer- 
site de Montreal. They were selected from 44 abstracts submitted in response 
to the call for papers. In addition, there were invited lectures by Andrei Broder 
(AltaVista), Fernando Pereira (AT&T Research Labs), and Ian H. Witten (Uni- 
versity of Waikato). 

The symposium was preceded by a two-day summer school set up to at- 
tract and train young researchers. The lecturers at the school were Greg Butler, 
Clement Lam, and Gus Grahne: BLAST! How do you search sequence databases?, 
David Bryant: Phytogeny, Raffaele Giancarlo: Algorithmic aspects of speech recog- 
nition, Nadia El-Mabrouk: Genome rearrangement, Laxmi Parida: Flexible- 
pattern discovery, and Ian H. Witten: Adaptive text mining: inferring structure 
from sequences. 

Combinatorial Pattern Matching (CPM) addresses issues of searching and 
matching strings and more complicated patterns such as trees, regular expres- 
sions graphs, point sets, and arrays. The goal is to derive non-trivial combina- 
torial properties of such structures and to exploit these properties in order to 
achieve superior performance for the corresponding computational problems. 

Over recent years a steady flow of high-quality research on this subject has 
changed a sparse set of isolated results into a fully-fledged area of algorithmics. 
This area is continuing to grow even further due to the increasing demand for 
speed and efficiency that comes from important and rapidly expanding appli- 
cations such as the World Wide Web, computational biology, and multimedia 
systems, involving requirements for information retrieval, data compression, and 
pattern recognition. The objective of the annual GPM gatherings is to provide an 
international forum for research in combinatorial pattern matching and related 
applications. 

The first ten meetings were held in Paris (1990), London (1991), Tucson 
(1992), Padova (1993), Asilomar (1994), Helsinki (1995), Laguna Beach (1996), 
Aahrus (1997), Piscataway (1998), and Warwick (1999). After the first meeting, a 
selection of papers appeared as a special issue of Theoretical Computer Science in 
Volume 92. The proceedings of the third to tenth meetings appeared as volumes 
644, 684, 807, 937, 1075, 1264, 1448, and 1645 of the Springer LNGS series. 

The general organization and orientation of GPM conferences is coordinated 
by a steering committee composed of: 



Alberto Apostolico, 

University of Padova 
& Purdue University 
Maxime Grochemore, 

Universite de Marne-la-Vallee 



Zvi Galil, 

Columbia University 
Udi Manber, 

Yahoo! Lnc. 
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The program committee of CPM 2000 consisted of: 



Amihood Amir, 

Bar Ilan University 
Bonnie Berger, 

MIT 

Byron Dom, 

IBM Almaden 

Raffaele Giancarlo, Co-chair, 
University of Palermo 
Dan Gusfield, 

University of California, Davis 
Monika Henzinger, 

Google, Inc. 

John Kececioglu, 

University of Georgia 

The local organizing committee, 
consisted of: 

Nadia El-Mabrouk 
Louis Pelletier 



Gad Landau, 

University of Haifa 
& Polytechnic University 
Wojciech Rytter, 

University of Warsaw 
& University of Liverpool 
Marie-France Sagot, 

Institut Pasteur 
Cenk Sahinalp, 

Case Western Reserve University 
David Sankoff, Co-chair, 

Universite de Montreal 
Jim Storer, 

Brandeis University 
Esko Ukkonen, 

University of Helsinki 

all from the Universite de Montreal, 

David Sankoff 

Sylvain Viart 



The conference was supported by the Centre de recherches mathema- 
tiques of the Universite de Montreal, in the context of a thematic year on 
Mathematical Methods in Biology and Medecine (2000-2001). 
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Identifying and Filtering Near-Duplicate 
Documents 



Andrei Z. Broder* 

AltaVista Company, San Mateo, CA 94402, USA 
andrei . broderSav . com 



Abstract. The mathematical concept of document resemblance cap- 
tures well the informal notion of syntactic similarity. The resemblance 
can be estimated using a fixed size “sketch” for each document. For a 
large collection of documents (say hundreds of millions) the size of this 
sketch is of the order of a few hundred bytes per document. 

However, for efficient large scale web indexing it is not necessary to de- 
termine the actual resemblance value: it suffices to determine whether 
newly encountered documents are duplicates or near-duplicates of docu- 
ments already indexed. In other words, it suffices to determine whether 
the resemblance is above a certain threshold. In this talk we show how 
this determination can be made using a "sample” of less than 50 bytes 
per document. 

The basic approach for computing resemblance has two aspects: first, 
resemblance is expressed as a set (of strings) intersection problem, and 
second, the relative size of intersections is evaluated by a process of 
random sampling that can be done independently for each document. 
The process of estimating the relative size of intersection of sets and the 
threshold test discussed above can be applied to arbitrary sets, and thus 
might be of independent interest. 

The algorithm for filtering near-duplicate documents discussed here has 
been successfully implemented and has been used for the last three years 
in the context of the AltaVista search engine. 



1 Introduction 

A Communist era joke in Russia goes like this: Leonid Brezhnev (the Party 
leader) wanted to get rid of the Premier, Aleksey Kosygin. (In fact he did, in 
1980.) So Brezhnev, went to Kosygin and said: “My dear friend and war comrade 
Aleksey, I had very disturbing news: I just found out that you are Jewish: I have 
no choice, I must ask you to resign.” Kosygin, in total shock says: “But Leonid, 
as you know very well I am not Jewish!”; then Brezhnev says: “Well, Aleksey, 
then think about it...” 

* Most of this work was done while the author was at Compaq’s System Research 
Center in Palo Alto. A preliminary version of this work was presented (but not 
published) at the “Fun with Algorithms” conference, Isola d’Elba, 1998. 

R. Giancarlo and D. Sankoff (Eds.): CPM 2000, LNCS 1848, pp. 1~^| 2000. 

@ Springer-Verlag Berlin Heidelberg 2000 
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What this has to do with near-duplicate documents? In mid 1995, the Al- 
taVista web search engine was built at the Digital research labs in Palo Alto (see 
^9). Soon after the first internal prototype was deployed, a colleague. Chuck 
Thacker, came to me and said: “I really like AltaVista, but it is very annoying 
that often half the first page of answers is just the same document in many 
variants.” “I know” said I. “Well,” said Chuck, “you did a lot of work on fin- 
gerprinting documents; can you make up a fingerprinting scheme such that two 
documents that are near-duplicate get the same fingerprint?” I was of course 
indignant: “No way!! You miss the idea of fingerprints completely: fingerprints 
are such that with high probability two distinct documents will have different 
fingerprints, no matter how little they differ! Similar documents getting the same 
fingerprint is entirely against their purpose.” So, of course. Chuck said: “Well, 
then think about it...” ...and as usual. Chuck was right. 

Eventually I found found a solution to this problem, based on a mathematical 
notion called resemblance Q. Surprisingly, fingerprints play an essential role in 
it. 

The resemblance measures whether two (web) documents are roughly the 
same, that is, they have the same content except for modifications such as for- 
matting, minor corrections, capitalization, web-master signature, logo, etc. The 
resemblance is a number between 0 and 1, defined precisely below, such that 
when the resemblance is close to 1 it is likely that the documents are roughly 
the same. To compute the resemblance of two documents it suffices to keep for 
each document a “sketch” of a few (three to eight) hundred bytes consisting 
of a collection of fingerprints of “shingles” (contiguous subsequences of words, 
sometimes called “q-grams” ) . The sketches can be computed fairly fast (linear in 
the size of the documents) and given two sketches the resemblance of the corre- 
sponding documents can be computed in linear time in the size of the sketches. 
Furthermore, clustering a collection of m documents into sets of closely resem- 
bling documents can be done in time proportional to m log m rather than . 

This first use of this idea was in a joint work with Steve Glassman, Mark 
Manasse, and Geoffrey Zweig to cluster a collection of over 30,000,000 docu- 
ments into sets of closely resembling documents (above 50% resemblance). The 
documents were retrieved from a month long “full” crawl of the World Wide 
Web performed by AltaVista in April 96. (See Q.) (It is amusing to note that 
three years later, by mid 1999, AltaVista was crawling well over 20 million pages 
daily.) 

Besides fingerprints, another essential ingredient in the computation of re- 
semblance is a pseudo-random permutation of a large set, typically the set 
[0,...,2®^ — 1]. In turns out that to achieve the desired result, the permuta- 
tion must be drawn from a min-wise independent family of permutations. The 
concept of min-wise independence is in the same vein as the well known concept 
of pair-wise independence, and has many interesting properties. Moses Gharikar, 
Alan Frieze, Michael Mitzenmacher, and I studied this concept in a paper Q. 

The World Wide Web continues to expand at a tremendous rate. It is esti- 
mated that the number of pages doubles roughly every nine moths to year 
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Hence the problem of eliminating duplicates and near-duplicates from the index 
is extremely important. The fraction of the total WWW collection consisting of 
duplicates and near-duplicates has been estimated at 30 to 45%. (See Q and 
^ 3 .) These documents arise innocently (e.g. local copies of popular documents, 
mirroring), maliciously (e.g., “spammers” and “robot traps”), and erroneously 
(crawler or server mistakes). In any case they represent a serious problem for 
indexing software for two main reasons: first, indexing of duplicates wastes ex- 
pensive resources and second, users are seldom interested in seeing documents 
that are “roughly the same” in response to their queries. 

However, when applying the sketch computation algorithm to the entire cor- 
pus indexed by AltaVista then even the modest storage costs described above 
become prohibitive. On the other hand, we are interested only whether the re- 
semblance is above a very high threshold; the actual value of the resemblance 
does not matter. 

This paper describes how to apply further processing to the sketches men- 
tioned above to construct for each document a short vector of “features.” With 
high probability, two documents share more than a certain number of features 
if and only if their resemblance is very high. For instance, using 6 features of 8 
bytes, that is, 48 bytes/document, for a set of 200,000,000 documents: 

— The probability that two documents that have resemblance greater than 
97.5% do not share at least two features is less than 0.01. The probability 
that two documents that have resemblance greater than 99% do not share 
at least two features is less than 0.00022. 

— The probability that two documents that have resemblance less than 77% 
do share two or more features is less than 0.01 The probability that two 
documents that have resemblance less than 50% share two or more features 
is less than 0.6 x 10“^. 

Thus the feature based mechanism for near-duplicate detection has excellent 
filtering properties. The probability of acceptance for this example (that is more 
than 2 common features) as a function of resemblance is graphed in Figure 1 on 
a linear scale and in Figure 2 on a logarithmic scale. 

2 Preliminaries 

We start by reviewing some concepts and algorithms described in more detail in 
and 

The basic approach for computing resemblance has two aspects: First, re- 
semblance is expressed as a set intersection problem, and second, the relative 
size of intersections is evaluated by a process of random sampling that can be 
done independently for each document. (The process of estimating the relative 
size of intersection of sets can be applied to arbitrary sets.) 

The reduction to a set intersection problem is done via a process called 
shingling. Via shingling each document D gets an associated set So- This is 
done as follows: 
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We view each document as a sequence of tokens. We can take tokens to be 
letters, or words, or lines. We assume that we have a parser program that takes 
an arbitrary document and reduces it to a canonical sequence of tokens. (Here 
“canonical” means that any two documents that differ only in formatting or 
other information that we chose to ignore, for instance punctuation, formatting 
commands, capitalization, and so on, will be reduced to the same sequence.) So 
from now on a document means a canonical sequence of tokens. 

A contiguous subsequence of w tokens contained in D is called a shingle. 
A shingle of length q is also known as a q-gram, particularly when the tokens 
are alphabet letters. Given a document D we can associate to it its w-shingling 
defined as the set of all shingles of size w contained in D. So for instance the 
4-shingling of 

(a, rose, is, a, rose, is, a, rose) 

is the set 



{(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is)} 

(It is possible to use alternative definitions, based on multisets. See Q for de- 
tails.) 

Rather than deal with shingles directly, it is more convenient to associate to 
each shingle a numeric uid (unique id). This done by fingerprinting the shingle. 
(Fingerprints are short tags for larger objects. They have the property that if two 
fingerprints are different then the corresponding objects are certainly different 
and there is only a small probability that two different objects have the same 
fingerprint. This probability is typically exponentially small in the length of the 
fingerprint.) 

For reasons explained in Q it is particularly advantageous to use Rabin 
fingerprints ^3 that have a very fast software implementation Q. Rabin finger- 
prints are based on polynomial arithmetic and can be constructed in any length. 
It is important to choose the length of the fingerprints so that the probability of 
collisions (two distinct shingles getting the same fingerprint) is sufficiently low. 
(More about this below.) In practice 64 bits Rabin fingerprints are sufficient. 

Hence from now on we associate to each document D a set of numbers Sd 
that is the result of fingerprinting the set of shingles in D. Note that the size of 
Sd is about equal to the number of words in D and thus storing Sd on-line for 
every document in a large collection is infeasible. 

The resemblance r(A, B) of two documents, A and B, is defined as 



r{A,B) 



|5aU5b|' 



Experiments seem to indicate that high resemblance (that is, close to 1) captures 
well the informal notion of “near-duplicate” or “roughly the same” . (There are 
analyses that relate the “g-gram distance” to the edit-distance - see ^fl.) 

Our approach to determining syntactic similarity is related to the sampling 
approach developed independently by Heintze |, though there are differences 
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in detail and in the precise definition of the measures used. Related sampling 
mechanisms for determining similarity were also developed by Manber Q and 
within the Stanford SCAM project 

To compute the resemblance of two documents it suffices to keep for each 
document a relatively small, fixed size sketch. The sketches can be computed 
fairly fast (linear in the size of the documents) and given two sketches the re- 
semblance of the corresponding documents can be computed in linear time in 
the size of the sketches. 

This is done as follows. Assume that for all documents of interest Sd Q 
{0, . . . , n — 1} '='^ [n] . (As noted, in practice n = 2®'^.) Let tt be chosen uniformly 
at random over Sn , the set of permutations of [n] . Then 



Pr(min{7r(S'A)} = min{7r(S'B)}) = 



|5aU5b| 



= r{A,B). 



( 1 ) 



Proof. Since tt is chosen uniformly at random, for any set X C [n] and any 
a; G A, we have 

Pr(min{7r(A)} = 7r(a;)) = |^. (2) 

In other words all the elements of any fixed set X have an equal chance to 
become the minimum element of the image of X under tt. 

Let a be the smallest image in t:{Sa^Sb). Then min{7r(S'^)} = min{7r(iS'B)}, 
if and only if a is the image of an element in Sa H Sb . Hence 



Pr(min{7r(S'A)} = min{7r(S'B)}) = Pr(7r ^(a) £ SaP Sb) 

\SaPSb\ 



|5aU5b| 



= r^{A,B). 



Hence, we can choose, once and for all, a set of t independent random per- 
mutations TTi, . . . , 7Tt. (For instance we can take t = 100.) For each document D, 
we store a sketch, which is the list 



Sa = (min{7Ti(S'A)},min{7r2(S'A)}, . . . , min{7Tt(S'A)}). 



Then we can readily estimate the resemblance of A and B by computing how 
many corresponding elements in Sa and Sb are equal. (In Q it is shown that 
in fact we can use a single random permutation, store the t smallest elements 
of its image, and then merge-sort the sketches. However for the purposes of this 
paper independent permutations are necessary.) 

In practice, we have to deal with the fact it is impossible to choose and 
represent tt uniformly at random in Sn for large n. We are thus led to consider 
smaller families of permutations that still satisfy the min-wise independence 
condition given by equation Q, since min-wise independence is necessary and 
sufficient for equation Q to hold. This is further explored in Q where it is shown 
that random linear transformations are likely to suffice in practice. See also Q 
for an alternative implementation. We will ignore this issue in this paper. 
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So far we have seen how to estimate the resemblance of a pair of documents. 
For this purpose the shingle fingerprints can be quite short since collisions have 
only a modest influence on our estimate if we first apply a random permutation 
to the shingles and then fingerprint the minimum value. 

However sketches allow us to group a collection of m documents into sets of 
closely resembling documents in time proportional to m log m rather than , 
assuming that the clusters are well separated which is the practical case. 

We perform the clustering algorithm in four phases. In the first phase, we 
calculate a sketch for every document as explained. This step is linear in the 
total length of the documents. 

To simplify the exposition of the next three phases we’ll say temporarily 
that each sketch is composed of shingles, rather than images of the fingerprint 
of shingles under random permutations of [n] . 

In the second phase, we produce a list of all the shingles and the documents 
they appear in, sorted by shingle value. To do this, the sketch for each document 
is expanded into a list of (shingle value, document ID) pairs. We simply sort this 
list. This step takes time 0(m log m) where m is the number of documents. 

In the third phase, we generate a list of all the pairs of documents that share 
any shingles, along with the number of shingles they have in common. To do 
this, we take the file of sorted (shingle, ID) pairs and expand it into a list of 
(ID, ID, count of common shingles) triplets by taking each shingle that appears 
in multiple documents and generating the complete set of (ID, ID, I) triplets 
for that shingle. We then apply a merge-sort procedure (adding the counts for 
matching ID - ID pairs) to produce a single file of all (ID, ID, count) triplets 
sorted by the first document ID . This phase requires the greatest amount of disk 
space because the initial expansion of the document ID triplets is quadratic in 
the number of documents sharing a shingle, and initially produces many triplets 
with a count of 1. Because of this fact we must choose the length of the shingle 
fingerprints so that the number of collisions is small. To ensure this we can take 
it to be say 21og2 m-l- 20. In practice 64 bits fingerprints suffice. 

In the final phase, we produce the complete clustering. We examine each 
(ID, ID, count) triplet and decide if the document pair exceeds our threshold 
for resemblance. If it does, we add a link between the two documents in a union- 
find algorithm. The connected components output by the union- find algorithm 
form the final clusters. 

3 Filtering Near-Duplicates 

Consider two documents, A and B, that have resemblance p. If p is close to 1, 
then almost all the elements of Sa and Sb will be pairwise equal. The idea of 
duplicate filtering is to divide every sketch into k groups of s elements each. The 
probability that all the elements of a group are pair-wise equal is simply p® and 
the probability that two sketches have r or more equal groups is 

Pk,s,r= E 

r<i<k 
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The remarkable fact is that for suitable choices of [fc, s, r] the polynomial 
Pk,s,r behaves as a very sharp high-band pass filter even for small values of k. 
For instance Figure 1 graphs -Pe, 14 , 2 ( 3 ^) on a linear scale and Figure 2 graphs it 
on a logarithmic scale. The sharp drop-off is obvious. 

To use this fact, we first compute for each document D the sketch Sd SiS 
before, using k-s independent permutations. (We can now be arbitrarily generous 
with the length of the fingerprints used to create shingle uid’s; however 64 bits 
are plenty for our situation.) We then split Sd into k groups of s elements and 
fingerprint each group. (To avoid dependencies, we use a different irreducible 
polynomial for these fingerprints.) We can also concatenate to each group a 
group id number before fingerprinting. 

Now all we need to store for each document is these k fingerprints, called 
“features” . Because fingerprints could collide the probability that two features 
are equal is 

where Pf is the collision probability. This would indicate that it suffices to use 
fingerprints long enough to so that Pf is less than say 10“®. However, when 
applying the filtering mechanism to a large collection of documents, we again 
use the clustering process described above, and hence we must avoid spurious 
sharing of features. Nevertheless, for our problem 64 bits fingerprints are again 
sufficient. 

It is particularly convenient, if possible, to choose the threshold r to be 1 or 
2. If r = 2 then the third phase of the merging process becomes much simpler 
since we don’t need to keep track of how many features are shared by various 
pairs of documents: we simply keep a list of pairs known to share at least one 
feature. As soon as we discover that one of these pairs shares a second feature, 
we know that with high probability the two documents are near-duplicates, and 
thus one of them can be removed from further consideration. If r = 1 the third 
phase becomes moot. In general it is possible to avoid the third phase if we 
again group every r features into a super-feature, but this forces the number of 
features per document to become (^). 



4 Choosing the Parameters 



As often the case in filter design, choosing the parameters is half science, half 
black magic. It is useful to start from a target threshold resemblance po- Ideally 



Pk 



,s,r 




I, for p > po; 
0, otherwise. 



Clearly, once s is chosen, r should be approximately k ■ Pq and the larger k (and 
r) the sharper the filter. (Of course, we are restricted to integral values for k, s, 
and r.) 

If we make the (unrealistic) assumption that resemblance is uniformly dis- 
tributed between 0 and 1 within the set of pairs of documents to be checked, 
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then the total error is proportional to 



I 



■po 



dx “t“ 






Po 



Differentiating with respect to po we obtain that this is minimized when -P(po) = 
1/2. To continue with our example we have Te, 14 , 2 ( 2 ;) = 1/2 for x = 0.909... . 

A different approach is to chose s so that the slope of a;® at a; = po is 
maximized. This happens when 



or s = l/ln(l/po)- For s = 14 the value that satisfies O is Po = 0.931... . 

In practice these ideas give only a starting point for the search for a filter 
that provides the required trade-offs between error bounds, time, and space. It 
is necessary to graph the filter and do experimental determinations. 

5 Conclusion 

We have presented a method that can eliminate near-duplicate documents from 
a collection of hundreds of millions of documents by computing independently 
for each document a vector of features less than 50 bytes long and comparing 
only these vectors rather than entire documents. The entire processing takes 
time O(mlogm) where m is the size of the collection. The algorithm described 
here has been successfully implemented and is in current use in the context of 
the AltaVista search engine. 
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Abstract. Much of computational linguistics in the past thirty years 
assumed a ready supply of general and linguistic knowledge, and limit- 
less computational resources to use it in understanding and producing 
language. However, accurate knowledge is hard to acquire and compu- 
tational power is limited. Over the last ten years, inspired in part by 
advances in speech recognition, computational linguists have been in- 
vestigating alternative approaches that take advantage of the statistical 
regularities in large text collections to automatically acquire efficient ap- 
proximate language processing algorithms. Such machine-learning tech- 
niques have achieved remarkable successes in tasks such as document 
classihcation, part-of-speech tagging, named-entity recognition and clas- 
sification, and even parsing and machine translation. 
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Abstract. What will it be like to work in tomorrow’s digital library? We 
begin by browsing around an experimental digital library of the present, 
glancing at some collections and showing how they are organized. Then 
we look to the future. Although present digital libraries are quite like 
conventional libraries, we argue that future ones will feel qualitatively 
different. Readers — and writers — will work in the library using a kind of 
context-directed browsing. This will be supported by structures derived 
from automatic analysis of the contents of the library — not just the cat- 
alog, or abstracts, but the full text of the books and journals — using new 
techniques of text mining. 



1 Introduction 

Over sixty years ago, science fiction writer H.G. Wells was promoting the con- 
cept of a “world brain” based on a permanent world encyclopedia which “would 
be the mental background of every intelligent [person] in the world. It would 
be alive and growing and changing continually under revision, extension and re- 
placement from the original thinkers in the world everywhere. ... even journalists 
would deign to use it” d- Eight years later, Vannevar Bush, the highest-ranking 
scientific administrator in the U.S. war effort, invited us to “consider a future 
device for individual use, which is a sort of mechanized private file and library 
...a device in which an individual stores all his books, records, and communi- 
cations, and which is mechanized so that it may be consulted with exceeding 
speed and flexibility” P). Fifteen years later, J.C.R. Licklider, head of the U.S. 
Department of Defense’s Information Processing Techniques Office, envisioned 
that human brains and computing machines would be coupled together very 
tightly, and imagined this to be supported a “network of ‘thinking centers’ that 
will incorporate the functions of present-day libraries together with anticipated 
advances in information storage and retrieval” |B|. Thirty- five years later we be- 
came accustomed to hearing similar pronouncements from the U.S. Presidential 
office. 

Digital libraries, conceived by visionary thinkers and fertilized with resources 
by today’s politicians, are undergoing a protracted labor and birth. Libraries are 
society’s repositories for knowledge, and digital libraries are of the utmost strate- 
gic importance in a knowledge-based economy. Not surprisingly, many countries 
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have initiated large-scale digital library projects. Some years ago the DLI initia- 
tive was set up in the U.S. (and has now entered a second phase); in the U.K. 
the Elib program was set up at about the same time; other countries in Europe 
and the Pacific Rim have followed suit. Digital libraries will likely figure amongst 
the most important and infiuential institutions of the 21st Century. 

But what is a digital library? Ten definitions of the term have been culled 
from the literature by Fox and their spirit is captured in the following brief 
characterization 

A focused collection of digital objects, including text, video, and audio, 
along with methods for access and retrieval, and for seleetion, 
organization, and maintenance of the eollection. 

This definition gives equal weight to user (access and retrieval) and librarian 
(selection, organization and maintenance). Other definitions in the literature, 
emanating mostly from technologists, omit — or at best downplay — the librar- 
ian’s role, which is unfortunate because it is the selection, organization, and 
maintenance that will distinguish digital libraries from the anarchic mess that 
we call the World Wide Web. However, digital libraries tend to blur what used 
to be a sharp distinction between user and librarian — because the ease of aug- 
menting, editing, annotating and re-organizing electronic collections means that 
they will support the development of new knowledge in situ. 

What’s it like to work in a digital library? Will it feel like a conventional 
library, but more computerized, more networked, more international, more all- 
encompassing, more convenient? I believe the answer is no: it will feel qualita- 
tively different. Not only will it be with you on your desktop (or at the beach, 
or in the plane), but information workers will work “inside” the library in a 
way that is quite unlike how they operate at present. It’s not just that knowl- 
edge and reference services will be fully portable, operating round the world, 
around the clock, throughout the year, freeing library patrons from geographic 
and temporal constraints — important and liberating as these are. It’s that when 
new knowledge is created it will be fully contextualized and both sited within 
and cited by existing literature right from its conception. 

In this paper, we browse around a digital library, looking at tools and tech- 
niques under development. “Browse” is used in a dual sense. First we begin by 
browsing a particular collection, and then look briefly at some others. Second, 
we examine the digital library’s ability to support novel browsing techniques. 
These situate browsing within the reader’s current context and unobtrusively 
guide them in ways that are relevant to what they are doing. Context-directed 
browsing is supported by structures derived from automatic analysis of the li- 
brary’s contents — not just the catalog, or abstracts, but the full text of the 
documents — using techniques that are being called “text mining.” Of course, 
other ways of finding information are important too — user searching, librarian 
recommendations, automatic notification, group collaboration — but here we fo- 
cus on browsing. The work described was undertaken by members of the New 
Zealand Digital Library project. 
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Fig. 1. (a) Village Level Brickmaking , (b) The collection’s home page 



2 The Humanity Development Library 



Figure El shows a book in the Humanity Development Library, a collection of 
humanitarian information put together by the Global Help Project to address 
the needs of workers in developing countries (www.nzdl.org/hdll. This book 
might have been reached by a directed full-text search, or by browsing one of 
a number of access structures, or by clicking on one of a gallery of images. On 
opening the book, which is entitled Village Level Brickmaking , a picture of its 
cover appears at the top, beside a hierarchical table of contents. In the figure, 
the reader has drilled down into a chapter on moulding and a subsection on sand 
moulding, whose text appears below. Readers can expand the table of contents 
from the section to the whole book; and expand the text likewise (which is very 
useful for printing). The ever-present picture of the book’s cover gives a feeling 
of physical presence and a constant reminder of the context. 

Readers can browse the collection in several different ways, as determined by 
the editor who created it. Figure Er shows the collection’s home page, at the top 
of which (underneath the logo) is a bar of five buttons that open up different 
access mechanisms. A subject hierarchy provides a tree-structured classification 
scheme for the books. Book titles appear in an alphabetical index. A separate 
list gives participating organizations and the material that they contributed. A 
“how-to” list of helpful hints, created by the collection’s editor, allows a par- 
ticular book to be accessed from brief phrases that describe the problems the 
book addresses. However a book is reached, it appears in the standard form il- 
lustrated in Figure^, along with the cover picture to give a sense of presence. 
The different access mechanisms help solve the librarian’s dilemma of where to 
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shelve books P|: each one appears on many different virtual shelves, shelves that 
are organized in different ways. 

Full-text search of titles and entire documents provide important additional 
access mechanisms. The search engine that we use, MG uni, supports search- 
ing over the full text of the document — not merely a document surrogate as 
in conventional digital library retrieval systems. User feedback from an earlier 
version of this collection indicated that Boolean searching was more confusing 
than helpful for the targeted users. Previous research suggests that difficulties 
with Boolean syntax and semantics are widespread, and transaction log anal- 
ysis of several library retrieval systems indicates that by far the most popular 
Boolean operator is AND; the others are rarely used. For all these reasons, the 
interface default for this collection is ranked queries. However, to enable users to 
construct high-precision conjunctive searches where necessary, selecting “search 
... for all the words” in the query dialog produces the syntax- free equivalent of 
a conjunctive query. 

Just as libraries display new acquisitions or special collections in the foyer 
to pique the reader’s interest, this collection’s home page (Figure ^>) highlights 
a particular book that changes every few seconds: it can be opened by clicking 
on the image. This simple display is extraordinarily compelling. And just as 
libraries may display a special book in a glass case, open at a different page 
each day, a “gallery” screen can show an ever-changing mosaic of images from 
pages of the books, remarkably informative images that, when clicked, open the 
book to that page. Or a scrolling “Times Square” display of randomly selected 
phrases that, when clicked, take you to the appropriate book. The possibilities 
are endless. 

The Humanity Development Library is a focused collection of 1250 books — 
miniscule by library standards, but nevertheless comprehensive within the tar- 
geted domain. It contains 53,000 chapters, 62 million words, and 32,000 pic- 
tures. Although the text occupies 390 MB, it compresses to 102 MB and the 
two indexes — for titles and chapters respectively — compress to less than 80 MB. 
The images (mostly in PNG format) occupy 290 MB. Associated files bring the 
total size of the collection to 505 MB. Even if there were twice as much text, 
and the same images, it would still fit comfortably on a GD-ROM, along with 
all the necessary software. A single DVD-ROM would hold a collection twenty 
times the size — still small by library standards, but immense for a fully portable 
collection. 

3 An Experimental Testbed: The NZDL 

The Humanity Development Library is just one of two dozen publicly-available 
collections produced by the New Zealand Digital Library (NZDL) project and 
listed on the project’s home page (www.nzdl.org), part of which is shown in 
Figure Et — this illustrates the the wide range of collections. This project aims 
to develop the underlying infrastructure for digital libraries and provide exam- 
ple collections that demonstrate how it can be used. The library is international. 
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Fig. 2. (a) Some of the NZDL collections, (b) Reading a Chinese book 



and the Unicode character set is used throughout: there are interfaces in En- 
glish, Maori, French, German, Arabic, and Chinese, and collections have been 
produced in all these languages. Digital libraries are particularly empowering 
for the disabled, and there is a text-only version of the interface intended for 
visually impaired users. 

The editors of the Humanity Development Library have gone to great lengths 
to provide a rich set of access structures. However, this is a demanding, labor- 
intensive task, and most collections are not so well organized. The basic access 
tool in the NZDL is full-text searching, which is available for all collections and 
is provided completely automatically when a collection is built. Some collections 
allow, in addition, traditional catalog searching based on author, title, and key- 
words, and full-text search within abstracts. Our experience is that while the user 
interface is considerably enhanced when traditional library cataloging informa- 
tion is available, it is often prohibitively expensive to create formal cataloging 
information for electronically-gathered collections. With appropriate indexes, 
full-text retrieval can be used to approximate the services provided by a formal 
catalog. 

3.1 Collections 

The core of any library is the collections it contains. A few examples will illustrate 
the variety and scope of the services provided. 

The historically first collection. Computer Science Technical Reports, now 
contains 46,000 reports — 1.3 million pages, half a billion words — extracted auto- 
matically from 34 GB of raw PostScript. There is no bibliographic or “metadata” 
information: we have only the contents of the reports (and the names of the FTP 
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sites from which they were gathered). Many are Ph.D. theses which would other- 
wise be effectively lost except to a miniscule community of cognoscenti: full-text 
search reaches right inside the documents and makes them accessible to anyone 
looking for information on that topic. 

As well as the simplified searching interface for the Humanity Development 
Library described above, users can choose a more comprehensive query interface 
(via a Preferences page). Case- folding and stemming can be independently en- 
abled or disabled, and full Boolean query syntax is supported as well as ranked 
queries. Moreover, in the Computer Science Technical Reports searches can be 
restricted to the first page of reports, which approximates an author/title search 
in the absence of specific bibliographic details of the documents. Although this 
is a practical solution, the collection nevertheless presents a raw, unpolished 
appearance compared with the Humanity Development Library, reflecting the 
difference between a carefully-edited set of documents, including hand-prepared 
classification indexes and other metadata, and a collection of information pulled 
mechanically off the Web and organized without any human intervention at all. 

There are several collections of books, including, for example, the English 
books entered by the Gutenberg project. Figure Et> shows a book in a collection 
of classical Chinese literature. The full text was available on the Web; we auto- 
matically extracted the section headings to provide the table of contents visible 
at the upper right, and scanned the book’s cover to generate the cover image. 
One can perform full-text search on the complete contents or on section headings 
alone, using the Chinese language (of course your Web browser must be set up 
correctly to work in Chinese). There is also a browsable list of book titles. 

An expressly bilingual collection of Historic New Zealand Newspapers con- 
tains issues of forty newspapers published between 1842 and 1933 for a Maori 
audience. Collected on microfiche, these constitute 12,000 page images. Although 
they represent a significant resource for historians, linguists and social scien- 
tists, their riches remain largely untapped because of the difficulty of accessing, 
searching and browsing material in unindexed microfiche form. Figure 0 shows 
the parallel English-Maori text retrieved from the newspaper Te Waka Maori of 
August 1878 in response to the query Rotorua, a small town in New Zealand. 
Searching is carried out on electronic text produced using OCR; once the target 
is identified, the corresponding page image can be displayed. 

3.2 The Greenstone Software 

All these collections are created using the Greenstone software developed by the 
NZDL project | l /j . Information collections built by Greenstone combine exten- 
sive full-text search facilities with browsing indexes based on different metadata 
types. There are several ways for users to find information, although they dif- 
fer between collections depending on the metadata available and the collection 
design. You can search for particular words that appear in the text, or within 
a section of a document, or within a title or section heading. You can browse 
documents by title: just click on the displayed book icon to read it. You can 
browse documents by subject. Subjects are represented by bookshelves: just click 
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Fig. 3. Searching the Historic New Zealand newspapers collection 



on a shelf to see the books. Where appropriate, documents come complete with 
a table of contents (constructed automatically): you can click on a chapter or 
subsection to open it, expand the full table of contents, or expand the full doc- 
ument. 

A distinction is made between searching and browsing. Searching is full- 
text, and — depending on the collection’s design — the user can choose between 
indexes built from different parts of the documents, or from different metadata. 
Some collections have an index of full documents, an index of sections, an index 
of paragraphs, an index of titles, and an index of section headings, each of 
which can be searched for particular words or phrases. Browsing involves data 
structures created from metadata that the user can examine: lists of authors, 
lists of titles, lists of dates, hierarchical classification structures, and so on. Data 
structures for both browsing and searching are built according to instructions in 
a configuration file, which controls both building and serving the collection. 

Rich browsing facilities can be provided by manually linking parts of docu- 
ments together and building explicit indexes and tables of contents. However, 
manually-created linking becomes difficult to maintain, and often falls into disre- 
pair when a collection expands. The Greenstone software takes a different tack: it 
facilitates maintainability by creating all searching and browsing structures au- 
tomatically from the documents themselves. No links are inserted by hand. This 
means that when new documents in the same format become available, they can 
be added automatically. Indeed, for some collections this is done by processes 





Browsing around a Digital Library: Today and Tomorrow 



19 



that wake up regularly, scout for new material, and rebuild the indexes — all 
without manual intervention. 

Collections comprise many documents: thousands, tens of thousands, or even 
millions. Each document may be hierarchically organized into sections (subsec- 
tions, sub-subsections, and so on). Each section comprises one or more para- 
graphs. Metadata such as author, title, date, keywords, and so on, may be as- 
sociated with documents, or with individual sections of documents. This is the 
raw material for indexes. It must either be provided explicitly for each document 
and section (for example, in an accompanying spreadsheet) or be derivable au- 
tomatically from the source documents. Metadata is converted to Dublin Core 
and stored with the document for internal use. 

In order to accommodate different kinds of source documents, the software 
is organized so that “plugins” can be written for new document types. Plug- 
ins exist for plain text documents, HTML documents, email documents, and 
bibliographic formats. Word documents are handled by saving them as HTML; 
PostScript ones by applying a preprocessor m- Specially written plugins also 
exist for proprietary formats such as that used by the BBC archives department. 
A collection may have source documents in different forms: it is just a matter 
of specifying all the necessary plugins. In order to build browsing indexes from 
metadata, an analogous scheme of ’’classifiers” is used: classifiers create indexes 
of various kinds based on metadata. Source documents are brought into the 
Greenstone system through a process called importing, which uses the plugins 
and classifiers specified in the collection configuration file. 

The system includes an “administrative” function whereby specified users 
can examine the composition of all collections, protect documents so that they 
can only be accessed by registered users on presentation of a password, and so on. 
Logs of user activity are kept that record all queries made to every Greenstone 
collection (though this facility can be disabled). 

Although primarily designed for Internet access over the World-Wide Web, 
collections can be made available, in precisely the same form, on CD-ROM. In 
either case they are accessed through any Web browser. Greenstone CD-ROMs 
operate on a standalone PC under Windows 3.X, 95, 98, and NT, and the inter- 
action is identical to accessing the collection on the Web — except that response 
is faster and more predictable. The requirement to operate on early Windows 
systems is a significant practical impediment to the software design, but is crucial 
for many users — particularly those in underdeveloped countries seeking access 
to humanitarian aid collections. If the PC is connected to a network (intranet 
or Internet), a custom-built Web server provided on each CD makes exactly the 
same information available to others through their standard Web browser. The 
use of compression ensures that the greatest possible volume of information can 
be packed on to a CD-ROM. 

The collection-serving software operates under Unix and Windows NT, and 
works with standard Web servers. A flexible process structure allows different 
collections to be served by different computers, yet be presented to the user in 
the same way, on the same Web page, as part of the same digital library m- 



20 



Ian H. Witten 



Existing collections can be updated and new ones brought on-line at any time, 
without bringing the system down; the process responsible for the user interface 
will notice (through periodic polling) when new collections appear and add them 
to the list presented to the user. 

4 Browsing in the Digital Library of the Future 

Current digital library systems often contain handcrafted indexes and links to 
provide different entry points into the information, and to bind it together into 
a coherent whole. This can produce high-quality, focused collections — but it is 
basically unscalable. Excellent new material will, of course, continue to be pro- 
duced using manual techniques, but it is infeasible to suppose that the mass of 
existing, archival material will be manually “converted” into high-quality digi- 
tal collections. The only scalable solution that is used currently for amorphous 
information collections is the ubiquitous search engine — but browsing is poorly 
supported by standard search engines. They operate at the wrong level, indexing 
words whereas people think in terms of topics, and returning individual docu- 
ments whereas people often seek a more global view. 

Suppose you are browsing a large collection of information such as a digital 
library — or a large Web site. Searching is easy, if you know what you are looking 
for — and can express it as a query at the lexical level. But current search mecha- 
nisms are not much use if you are not looking for a specific piece of information, 
but are generally exploring the collection. Studies of browsing have shown that it 
is a rich and fundamental human information behavior, a multifaceted and mul- 
tidimensional human activity jS]. But it is not well-supported for large digital 
collections. 

We look at three browsing interfaces that capitalize on automatically-gener- 
ated phrases and keyphrases for a document collection. The first of these is a 
phrase-based browser that is specifically designed to support subject-index-style 
browsing of large information collections. The second uses more specific phrases 
and concentrates on making it convenient to browse closely-related documents. 
The third is a workbench that facilitates skimming, reading, and writing docu- 
ments within a digital library — a qualitatively different experience from working 
in a library today. All three are based on phrases and keyphrases extracted 
automatically from the document text itself. 

4.1 Emulating Subject Indexes: Phrase Browsing 

Phrases extracted automatically from a large information collection form an 
excellent basis for browsing and accessing it. We have developed a phrase-based 
browser that acts an interactive interface to a phrase hierarchy that has been 
extracted automatically from the full text of a document collection. It is designed 
to resemble a paper-based subject index or thesaurus. 

We illustrate the application of this scheme to a large Web site: that of the 
United Nations Food and Agriculture Organization (www.fao.org), an interna- 
tional organization whose mandate is to raise levels of nutrition and standards 
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Fig. 4. (a) Browsing for information about forest, (b) Expanding on forest prod- 
ucts 



of living, to improve agricultural productivity, and to better the condition of 
rural populations. The site contains 21,700 Web pages, as well as around 13,700 
associated files (image files, PDFs, etc). This corresponds to a medium-sized 
collection of approximately 140 million words of text. It exhibits many problems 
common to large, public Web sites. It has existed for some time, is large and 
continues to grow rapidly. Despite strenuous efforts to organize it, it is becoming 
increasingly hard to find information. A search mechanism is in place, but while 
this allows some specific questions to be answered it does not really address the 
needs of the user who wishes to browse in a less directed manner. 

Figure shows the phrase browsing interface in use. The user enters an 
initial word in the search box at the top. On pressing the Search button the 
upper panel appears. This shows the phrases at the top level in the hierarchy 
that contain the search word-in this case the word forest. The list is sorted by 
phrase frequency; on the right is the number of times the phrase appears, and 
to the left of that is the number of documents in which it appears. 

Only the first ten phrases are shown, because it is impractical with a Web 
interface to download a large number of phrases, and many of these phrase 
lists are very large. At the end of the list is an item that reads Get more phrases 
(displayed in a distinctive color); clicking this will download another ten phrases, 
and so on. A scroll bar appears to the right for use when more than ten phrases 
are displayed. The number of phrases appears above the list: in this case there 
are 493 top-level phrases that contain the term forest. 

So far we have only described the upper of the two panels in Figure EJi- The 
lower one appears as soon as the user clicks one of the phrases in the upper list. In 
this case the user has clicked forest products (that is why that line is highlighted 
in the upper panel) and the lower panel, which shows phrases containing the 
text forest products, has appeared. 
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If one continues to descend through the phrase hierarchy, eventually the 
leaves will be reached. A leaf corresponds to a phrase that occurs in only one 
document of the collection (though the phrase may appear several times in that 
document). In this case, the text above the lower panel shows that the phrase 
forest products appears in 72 phrases (the first ten are shown), and, in addition, 
appears in a unique context in 382 documents. The first ten of these are available 
too, though the list must be scrolled down to make them appear in the visible 
part of the panel. Figure it> shows this. In effect, the panel shows a phrase 
list followed by a document list. Either of these lists may be null (in fact the 
document list is null in the upper panel, because the word forest appears only 
in other phrases or in individual unique contexts). The document list displays 
the titles of the documents. 

It is possible, in both panels of Figures^ and b, to click Get more phrases 
to increase the number of phrases that are shown in the list of phrases. It is also 
possible, in the lower panels, to click Get more documents (again it is displayed 
at the end of the list in a distinctive color, but to see that entry it is necessary 
to scroll the panel down a little more) which increases the number of documents 
that are shown in the list of documents. 

Clicking on a phrase will expand it. The page holds only two panels, and if 
a phrase in the lower panel is clicked the contents of that panel move up into 
the top one to make space for the phrase’s expansion. Alternatively, clicking 
on a document will open that document in a new window. In fact, the user in 
Figure Et> has clicked on IV FORESTS AND TRADE AND THE ENVIRON- 
MENT, and this brings up a Web page with that title. As Figure Et indicates, 
that page contains 15 occurrences of the phrase forest products. 

We have experimented with several different ways of creating a phrase hi- 
erarchy from a document collection. An algorithm called Sequitur builds a 
hierarchical structure containing every single phrase that occurs more than once 
in the document collection m We have also worked on a scheme called Kea 
which extracts keyphrases from scientific papers. This produces a far smaller, 
controllable, number of phrases per document jSj. The scheme that we use for 
the interface in Figure 0 is an amalgam of the two techniques H31- 

The phrases extracted represent the topics present in the Eood and Agricul- 
ture Organization site, as described in the terminology of the document authors. 
We have investigated how well this set of phrases matches the standard termi- 
nology of the discipline by comparing the extracted phrases with phrases used by 
the AGROVOC agricultural thesaurus. There is a substantial degree of overlap 
between the two sets of phrases, which provides some confirmation of the quality 
of the extracted phrases as subject descriptors. 



4.2 Improved Browsing Using Keyphrase Indexes 

Another a new kind of search interface that is explicitly designed to support 
browsing is based on keyphrases automatically extracted from the documents p] 
using the Kea system [5( . A far smaller number of phrases are selected than in 
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the system described above — only four or five per document. The automatically- 
extracted keyphrases form the basic unit of both indexing and presentation, al- 
lowing users to interact with the collection at the level of topics and subjects 
rather than words and documents. The system displays the topics in the collec- 
tion, indicates coverage in each area, and shows all ways a query can be extended 
and still match documents. 

The interface is shown in Figure El A user initiates a query by typing words 
or phrases and pressing the Search button, just as with other search engines. 
However, what is returned is not a list of documents, but a list of keyphrases 
containing the query terms. Since all phrases in the database are extracted 
from the source documents, every returned phrase represents one or more doc- 
uments in the collection. Searching on the word text, for example, returns a list 
of phrases including text editor (a keyphrase for twelve documents), text com- 
pression (eleven documents), and text retrieval (ten documents), as shown in 
Figure 0 The phrase list provides a high-level view of the topics represented in 
the collection, and indicates, by the number of documents, the coverage of each 
topic. 

Following the initial query, a user may choose to refine the search using one 
of the phrases in the list, or examine a topic more closely. Since they are derived 
from the collection itself, any further search with these phrases is guaranteed to 
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produce results — and furthermore, the user knows exactly how many documents 
to expect. To examine the documents associated with a phrase, the user selects it 
from the list, and previews of documents for which it is a keyphrase are displayed 
in the lower panel of the interface. Selecting any preview shows the document’s 
full text. 

Experiments with users show that this interface is superior to a traditional 
search system for answering particular kinds of questions: evaluating collections 
(“what’s in this collection”), exploring areas (“what subtopics are available in 
area X”), and general information about queries (“what kind of queries will 
succeed in area X”, “how can I specialize or generalize my query”). Note that 
many of these questions are as relevant to librarians as they are to library users. 
However, this mechanism is not intended to replace conventional search systems 
for specific queries about specific documents. 



4.3 Reading and Writing in a Digital Library 

A third prototype system, developed by Jones |Z|, shows how phrases can assist 
with skimming, reading, and writing documents in the digital library. It uses 
the keyphrases extracted from a document collection as link anchors to point to 
other documents. When reading a document, the keyphrases in it are highlighted. 
When writing one, phrases are dynamically linked, and highlighted, as you type. 

Figure shows the interface. To the left is the document being examined 
(read or authored); in the center is the keyphrase pane; and to the right is the 
library access pane. Keyphrases that appear in documents in the collection are 
highlighted; this facilitates rapid skimming of the content because the darker text 
points out items that users often highlight manually with a marker pen. Different 
gray levels reflect the “relevance” of the keyphrase to the document, and the user 
can control the intensity to match how they skim. Each phrase is hyperlinked, 
using multiple-destination links, to other documents for which it is a keyphrase 
(the anchor is the small spot that follows the phrase). The center panel shows 
all the keyphrases that appear in this document, with their frequency and the 
number of documents in the library for which they are keyphrases. Controls are 
available to sort the list in various different ways. Some of these phrases have 
been selected by the user, and on the right is a ranked list of items in the library 
that contain them as keyphrases — ranked according to a special metric designed 
for use with keyphrases. 

With this interface, hurried readers can skim a document by looking at the 
highlighted phrases. In-depth readers can instantly access other relevant doc- 
uments (including dictionaries or encyclopaedias). They can select a subset of 
relevant phrases and instantly have the library searched on that set. Writers — as 
they type — can immediately gain access to documents that are relevant to what 
they are writing. 
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Fig. 6. Working on a paper inside the digital library 



5 Conclusion 

Digital libraries have finally arrived. They are different from the World Wide 
Web: libraries are focused collections, and it is the act of selection that gives them 
focus. For many practical reasons (including copyright, and the physical difficulty 
of digitization), digital libraries will not vie with archival national collections, 
not in the foreseeable future. Their role is in specialist, targeted collections of 
information. 

Established libraries of printed material have sophisticated and well-develop- 
ed human and computer-based interfaces to support their use. But they are not 
well integrated for working with computer tools: a bridging process is required. 
Information workers can immerse themselves physically in the library, but they 
cannot take with them their tasks, tools, and desktop workspaces. The digital 
library will be different: we will work “inside” it in a sense that it totally new. 

But even for a focused collection, creating a high-quality digital library is 
a highly labor-intensive process. To provide the richness of access and inter- 
connection that makes a digital library comfortable requires enormous editorial 
effort. And when the collection changes, maintenance becomes an overriding 
issue. Fortunately, techniques of text mining are emerging that offer the possi- 
bility of automatic identification of semantic items from plain text. Carefully- 
constructed user interfaces can take advantage of the information that they 
generate to provide a library experience that is qualitatively different from a 
physical library — not just in access and convenience, but in terms of the quality 
of browsing and information accessibility. Tomorrow, digital libraries will put 
the right information at your fingertips. 
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Abstract. Speech recognition is an area with a sizable literature, but 
there is little discussion of the topic within the computer science algo- 
rithms community. Since many of the problems arising in speech recog- 
nition are well suited for algorithmic studies, we present them in terms 
familiar to algorithm designers. Such cross fertilization can breed fresh 
insights from new perspectives. 

This material is abstracted from A. L. Buchsbaum and R. Giancarlo, Al- 
gorithmic Aspects of Speech Recognition: An Introduction, ACM Journal 
of Experimental Algorithmics, Vol. 2, 1997, luup : / / www . nea. acm. ors 



1 Introduction 

Automatic recognition of human speech by computers (ASR, for short) has been 
an area of scientific investigation for over forty years. (See, for instance, Waibel 
and Lee Its nature is inherently interdisciplinary, because it involves exper- 
tise and knowledge coming from such diverse areas as signal processing, artificial 
intelligence, statistics, and natural languages. Intuitively, one can see the core 
computational problems in ASR in terms of searching large, weighted, spaces. 
They therefore naturally lend themselves to the formulation of algorithmic ques- 
tions. Historically, however, advances in ASR and in design of algorithms have 
found different places in the literature and in fact mostly address separate audi- 
ences (with a few exceptions Q). This is unfortunate, because ASR challenges 
algorithm designers to find solutions that are asymptotically efficient and also 
practical in real cases. 

The general problem areas that are involved in ASR — in particular, graph 
searching and automata manipulation — are well known to and have been exten- 
sively studied by algorithms experts, sometimes resulting in very tight theoreti- 
cal bounds and even good practical implementations (e.g., shortest path finding 

* Part of this work was done while the author was an MTS at AT&T Bell Labs and 
continued while visiting AT&T Labs. Part of the author’s research is supported 
by the Italian Ministry of Scientific Research, Project “Bioinformatica e Ricerca 
Genomica.” 
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and finite state automata minimization). The manifestations of these problems 
in ASR, however, are so large as to defy those solutions. The result is that most 
of the progress in speech recognition to date is due to clever heuristic methods 
that solve special cases of the general problems. Good characterizations of these 
special cases, as well as theoretical studies of their solutions, remain for the most 
part lacking. 

While ASR is well suited for exploration and experimentation by algorithm 
theorists and designers, there are obvious obstacles in learning the formalisms 
and the design and experimental methodologies developed by the ASR commu- 
nity. The aim of this abstract is to present, at a very high level, a few research 
areas in ASR, extracted from a sizable body of literature in that area. This work 
abstracts an earlier paper Q, in which we present a more detailed view of ASR 
from an algorithmic perspective. Here we give only a high-level overview of ASR 
and some relevant problems with a technical formalism familiar to algorithm 
designers. We refer the reader to the previous paper Q for a more detailed pre- 
sentation of these research areas, as well as a proper identification of the context 
from which they have been abstracted. 

2 Algorithmic Research Areas in ASR 

We present ASR in terms of the maximum likelihood paradigm QJ, which has 
become dominant in the area. Fig. Jgives a block diagram and a short description 
of each module of a speech recognizer. Intuitively, one can think of a lattice as a 
data structure representing a set of strings. 

Hidden Markov models {HMMs, for short) are basic building blocks of ASR 
systems HMMs are most commonly used to model phones, which are the ba- 
sic units of sound to be recognized. For instance, they are the models underlying 
the acoustic-phonetic recognizer of Fig. J 

A great deal of research has been invested in good algorithms for training the 
models and for their subsequent use in ASR systems. In abstract terms, an HMM 
is a finite state machine that takes as input a string and produces a probability 
for that string matching the model. The probability gives the likelihood that 
the input string actually belongs to the set of strings on which the HMM has 
been trained. Every path through the machine contributes to the probability 
assigned to an input string. The forward procedure ^ 3 , a dynamic programming 
algorithm, computes that probability in time proportional to the product of 
the length of the string and the size of the HMM. In algorithmic terms, the 
two parameters of interest here are the speed of the algorithm performing the 
“matching” and the size of the model. Given that the HMMs used in ASR have 
a very special topology, it is quite natural to consider the two following areas. 

Research Area 1. Devise faster methods to compute the probability that an 
HMM matches a given input string. In particular, can the topology of the HMM 
be exploited towards this end? 
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Fig. 1. Block diagram of a speech recognizer. Input speech is digitized into a 
sequence of feature vectors. An acoustic-phonetic recognizer transforms the fea- 
ture vectors into a time-sequenced lattice of phones. A word recognition module 
transforms the phone lattice into a word lattice, with the help of a lexicon. Fi- 
nally, in the case of continuous or connected word recognition, a grammar is 
applied to pick the most likely sequence of words from the word lattice. 



Research Area 2. Devise algorithms to reduce the size of an HMM. This is 
analogous to the determinization and minimization problems on finite-state au- 
tomata, which will be discussed later. 

A simple way to speed the estimation of the probability of a string matching 
the model is to resort to an approximation: find the most probable path in the 
model that generates the string (rather than the sum of the probabilities of all the 
matching paths) . One can find the desired path using the Viterbi algorithm 
which computes a dynamic programming recurrence referred to as the Viterbi 
equation. Close examination reveals that the Viterbi equation is a variant of 
the Bellman-Ford equations for computing shortest paths in unweighted graphs 
Q, in which the weight of an edge is a function of time. That is, the weight 
depends on the time instant (the position being matched in the string) in which 
we are actually using the edge. For this latter class of shortest path problems, 
not much is known 

Research Area 3. Devise faster algorithms to solve the Viterbi equation. As 
with Research Area^^ investigate how to characterize and exploit the particular 
graph topologies that arise in speech recognition. 
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Another avenue of research is to compute iteratively better approximations 
to the final values of the Viterbi equation, analogously to the scaling algorithms 
used for standard shortest path problems Q. 

Research Area 4. Devise an analogue of the scaling technique that would apply 
to the computation of the Viterhi Equation. 

The models describing the ASR process tend to be large and highly struc- 
tured. One can exploit that structure by using heuristics to prune part of the 
search space. In these settings, the A* algorithm from artificial intelligence is 
often used We are now at the lexicon or grammar level shown in Fig. J In 
general, the A* algorithm will find an “optimal solution” only when the heuris- 
tic used is admissible. Unfortunately, most of the heuristics used in ASR cede 
admissibility for speed. No analytic guarantees on the quality of the solutions 
are available. 

Research Area 5. Investigate the potential for admissible heuristics that will 
significantly speed computation of the A* algorithm, or determine how to measure 
theoretically the error rates of fast but inadmissible heuristics. 

Pereira et al. have recently devised a new paradigm to describe the 

entire ASR process. They formalize it in terms of weighted transductions. In fact, 
one can see the entire process in Fig.H^ts a cascade of translations of strings 
in one “language” into strings of another. This new paradigm is far reaching: 
it gives (a) a modular approach to the design of the various components of an 
ASR system; and (b) a solid theoretic foundation to which to attach many of 
the problems that seemed solvable only by heuristics. Moreover, it has relevant 
impact on automata and transduction theory Q. Essential to this new paradigm 
are the problems of determinizing and minimizing weighted automata. Despite 
the vast body of knowledge available in formal languages and automata theory, 
those two fundamental problems were open until Mohri solved them As 
one might guess, determinization of a weighted automaton is a process that can 
take at least exponential time and produce a similar increase in the number of 
states of the input automaton. It is remarkable that, when the determinization 
algorithm by Mohri is applied to weighted automata coming from ASR, the 
resulting deterministic automata are, in most cases, smaller than their non- 
deterministic inputs. Given the importance of reducing the size of the automata 
produced in ASR, the following two areas are extremely important from the 
practical point of view and also challenging from the theoretical point of view. 

Research Area 6. Characterize the essential properties of sequential weighted 
automata that permit efficient determinization. 

We separately Q report partial progress in Research AreaH resulting in an 
approximation strategy Q that increases the space reduction of Mohri’s algo- 
rithm when applied to weighted automata from ASR. 
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Research Area 7. Unify the results of Mohri and Breslauer (provably 
optimal minimization for sequential transducers, a special class of transducers) 
with those of Roche (minimization of transducers without proven asymptotic 
size reductions, but with good practical performance). 

In fact, minimization of transducers in general is an open area of research, 
although partial progress towards a theoretical foundation exists Q . 

3 Sources of Code and Data 

One important aspect of designing algorithms for ASR is experimentation. To 
be accepted, new algorithms must be compared to a set of standard bench- 
marks. Here we give a limited set of pointers to find code and data for proper 
assessments; we refer the reader to the full paper Q for additional sources. 

Commercially available speech recognition products are typically designed for 
use by applications developers rather than researchers. One product, though, 
called HTK, provides a more low-level toolkit for experimenting with speech 
recognition algorithms in addition to an application-building interface. It is avail- 
able from Entropic Research Lab, Inc., at n;T;p://www.em;ropic.com/in;ij/ 

To obtain the finite-state toolkit developed by Pereira et al. one can send 
mail to fsm@research.att.com. 

A large variety of speech and text corpora is available from the Linguistic 
Data Consortium i ittp://www.iac.upenn.eau/ i. This service is not free, al- 
though membership in LDC reduces the cost per item. The following are some 
of the commonly used speech corpora available. 

TIMIT Acoustic-Phonetic Continuous Speech Corpora. 

RM Resource Management Corpora. 

ATIS Air Travel Information System. 

CSR Continuous Speech Recognition. 

SWITCHBOARD Switchboard Corpus of Recorded Telephone Conversations. 

For the multilingual experimenter, there is the Oxford Acoustic Phonetic 
Database on CDROM It is a set of two CDs that contain digitized recordings 
of isolated lexical items plus isolated monophthongs from each of the following 
eight languages/dialects: American English, British English, French, German, 
Hungarian, Italian, Japanese, and Spanish. 
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Abstract. Given an input sequence of data, a “rigid” pattern is a re- 
peating sequence, possibly interspersed with “dont care” characters. In 
practice, the patterns or motifs of interest are the ones that also allow 
a variable number of gaps (or “dont care” characters): we call these the 
flexible motifs. The number of rigid motifs could potentially be exponen- 
tial in the size of the input sequence and in the case where the input is 
a sequence of real numbers, there could be uncountably inhnite number 
of motifs (assuming two real numbers are equal if they are within some 
(5 > 0 of each other). It has been shown earlier that by suitably defin- 
ing the notion of maximality and redundancy, there exists only a linear 
(or no more than 3n) number of irredundant motifs and a polynomial 
time algorithm to detect these irredundant motifs. Here we present a 
uniform framework that encompasses both rigid and flexible motifs with 
generalizations to sequence of sets and real numbers and show a some- 
what surprising result that the number of irredundant flexible motifs still 
have a linear bound. However, the algorithm to detect them has a higher 
complexity than that of the rigid motifs. 



1 Introduction 

Given an input sequence of data, a “rigid” pattern is a repeating sequence, 
possibly interspersed with “dont care” characters. For example given a string 
s = abcdaXcdabbcd, m = a.cd is a pattern that occurs twice in the data at 
positions 1 and 5 in s. The data could be a sequence of characters or sets of 
characters or even real values. In practice, the patterns or motifs of interest are 
the ones that also allow a variable number of gaps (or “dont care” characters): we 
call these the flexible motifs. In the above example, the flexible motif would occur 
three times at positions 1, 5 and 9. At position 9 the dot character represents two 
gaps instead of one. Flexible pattern is an extension of the generalized regular 
pattern described in jH.IFG98) in the sense that the input here could also be a 
sequence of real numbers. 

Pattern discovery in biomolecular data has been often closely intertwined 
with the problem of alignment of sequences {ZZ96] . |Alt89j . |GL88j . [HTHl95j . 
Pl5^- IMSZM97I. IwS al. lLSB+93], IMVFhbl sequences |,JCH95) . 

EM, EM, IWCM+961 . EHM, EM, fFRP+991 . IGYW+971 . Ir fcFot , 
|RCI'p99| . The reader may refer to |R.FP~^r)0] for a history in this regard: the 
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paper traces the interesting journey from inception to the current research in pat- 
tern discovery Q [B,IE(f98| surveys approaches to the pattern discovery problem 
in biological applications and presents a systematic formal framework that en- 
ables an easy comparison of the power of the various algorithms/software system 
available: the focus is on the different algorithms and the class of patterns being 
handled. There also has been work in literature that deals primarily with pattern 
discovery IIVI'9:-IW(;M+94lfik'98k laiOd . A very closely related class of prob- 
lems is that of data mining and rule discovery IIAS95iAM S~*~95| Bav92pU(lM97l . 
However, in this paper we intend take a different route to understanding pattern 
discovery in the context of any application: be it in biomolecular data, market 
survey data, English usage, system log data etcetera. The reader should bear in 
mind that the definition of patterns/motifs emerge from the actual usage and 
the contexts in which they are used. 

The task of discovering patterns must be clearly distinguished from that of 
matching a given pattern in a database. In the latter situation we know what we 
are looking for, while in the former we do not know what is being sought. Hence 
a pattern discovery algorithm must report all patterns. The total number oi rigid 
motifs could potentially be exponential in the size of the input sequence and in 
the case where the input is a sequence of real numbers, there could be uncount- 
ably infinite number of motifs (assuming two real numbers are equal if they are 
within some (5 > 0 of each other) . It has been shown earlier | |f*a,r99iPT},F+flfl| that 
by suitably defining the notion of maximality and redundancy, there exists only 
a linear (or no more than 3n) number of irredundant motifs. This is meaningful 
also from an algorithmic viewpoint, since a polynomial time algorithm has been 
presented to detect these irredundant motifs. This family of irredundant motifs 
is also very characteristic of the family of all the motifs: in applications such 
as multiple sequence alignment, it has been shown that the irredundant motifs 
suffice to obtain the alignment 



This bound on the number of 
useful motifs gives validation to motif-based approaches, since the total number 
of irredundant motifs does not explode. This result is of significance to most 
applications that use pattern discovery as the basic engine such as data mining, 
clustering and matching. 

Flexible motifs obviously capture more information than rigid motifs and 
have been found to be very useful in different applications |B.TE(I98j . Here we 
use a uniform framework that encompasses both rigid and flexible motifs. We 
give some very natural definitions of maximality and redundancy for the flexible 
motifs and show a somewhat surprising result that the number of irredundant 
flexible motifs is still bounded by 3n. However, the algorithm to detect them has 
a higher complexity than that of the rigid motifs. 

In most of the previous work mentioned above, the discovery process is closely 
tied with the applications. Here we separate the two: this is a useful viewpoint 
since it allows for a better understanding of the problem and helps capture the 
commonality in the different applications. In almost all applications, the size of 



^ The reader may also visit http : / /www . research . ibm . com/bioinformatics for some 
current work in this area. 
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the data is very large which justifies the task of automatic or unaided pattern 
discovery. Since it is very hard to verify if all the patterns have been discovered, 
it becomes all the more important to assure that all patterns (depending on the 
definition) have been found. As the definition of a pattern becomes relaxed, the 
total number of candidate patterns increases substantially. For instance, the total 
number of rigid patterns with no “dont care” characters can be no more than n 
on an input of size n. However, if “dont care” characters are allowed the number 
of motifs is exponential in n. The next task, which is application specific, is to 
prune this large set of patterns by a certain boolean criterion C. Usually C is based 
on a real valued function say C and C = TRUE if and only if C > t, for some 
threshold t.If C{m) = TRUE, motif m is of interest and if C(m) = EALSE, m 
need not be reported. C is a measure of significance depending on the application. 
There are at least two issues that need to be recognized here: one is a lack of 
proper understanding (or a consensus amongst researchers) of the domain to 
define appropriate C and the other is that even if C is overwhelmingly acceptable, 
the guarantee that all patterns m with C{m) = TRUE have been detected is 
seldom provided. The former issue is a debate that is harder to resolve and the 
latter exists because of the inherent difficulty of the models (function C). 

In our study of the problem in the following paragraphs, we will assume 
that to apply C, we have all the patterns at our disposal. We will focus on the 
task of trying to obtain the patterns in a modelless manner. However, C may be 
applied in conjunction with our algorithm to obtain only those m’s that satisfy 
C{m) = TRUE. 

In the rest of the paper the terms pattern and motif will be used interchange- 
ably. 



2 Basics 

Let s be a sequence on an alphabet E, ^ E. A character from E, say a, is 
called a solid character and is called a “dont care” character. For brevity of 
notation, if a; is a sequence, then \x\ denotes the length of the sequence and if x is 
a set of elements then \x\ denotes the cardinality of the set. The (1 < J < |s|) 
character of the sequence is given by s[j]. 

Definition 1 . (cfi <,=,< (T2) If cfi is a “dont care” character then a\ -< a2- If 
both (Ji and G2 are identical characters in E, then a\ =02- If either a\ -< (J2 or 
ai = (J2 holds, then ai A <^2 ■ 



Definition 2 . (Annotated Dot Character, ) An annotated character is 
written as where x is a set of positive integers {x\,X2, ■ ■ ■ ,Xk\ or an interval 
X = [xi,Xu], representing all integers between xi and Xu including xi and Xu- 

To avoid clutter, the annotation superscript x, in the rest of the paper, will be 
an integer interval. 
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Definition 3 . (Realization) Let p be a string on S and annotated dot charac- 
ters. If p' is a string obtained from p by replacing each annotated dot character 
by I dot characters where I G x, p' is a realization of p. 

For example, ii p = then p' = a...b...cde is a realization of p and 

so is p” = a...b cde. 

Definition 4 . (p Occurs at 1 ) A string, p, on S U {.}, occurs at position I on 
■s P[j] ^ + j] holds for 1 < j < \p\. A string, p, on S and annotated dot 

characters, occurs at position I in s if there exists a realization p' of p that occurs 
at 1 . 

If p is flexible then p could possibly occur multiple times at a location on a 
string s. For example, if s = axbcbc, then p = occurs twice at position 1 

as SLxhcbc and axbcbc. Let f(l denote the number of occurrences at location 1. 
In this example ffl = 2. 

Definition 5 . (Motif m. Location List Cm) Given a string s on alphabet S and 
a positive integer k, k < |s|, a string m on S and annotated dot characters is a 
k -motif with location list Cm = {h,l 2 , ■ ■ ■ :lp)> */ w[I], 77 i[|m|] £ and m occurs 
at each I £ Cm with p > k. 

If m is a string on if U {.}, m is called a rigid motif and if m is a string on S 
and annotated dot characters, where at least one annotation a; in m represents 
more than one integer, then m is called a flexible motif. 

Definition 6. (mi ^ m2) Given two motifs mi and m2 with \mi\ < \m2\, 
mi < n^2 holds if for every realization m'l of mi there exists a realization m'2 
such that m'i[j] ^ m'2[j], 1 < j < |mi|. 

For example, let mi = AB..E, m2 = AK..E and m 3 = ABC.E.G. Then mi ^ 
m 3 , and m2 m 3 . The following lemmas are straightforward to verify. 

Lemma 1 . If mi ^ m2, then Cmi 3 Cm2- ^ and m2 ^ m3, then 

mi Cl W3. 

Definition 7 . (Sub-motifs of Motif m, ,h] J Given a motif m let m[ji], 
m[j 2 ], • ■ ■ be the I solid characters in the motif m. Then the sub-motifs of 

m are given as 1 < i < k < I, which is obtained by dropping all the 

characters before (to the left of) ji and all characters after (to the right of) jk 
in m. 

Definition 8. (Maximal Motif) Let pi, p2, ..., Pk be the motifs in a sequence 
s. Define pi[j] to be iff > \pi\. A motif pi is maximal in composition if and 
only if there exists no pi, I ^ i with Cp^ = Cp^ and Pi C Pi- A motif pi, maximal 
in composition, is also maximal in length if and only if there exists no motif pj, 
j ^ i, such that Pi is is a sub-motif of pj and = \Cmj\- A maximal motif 

is maximal both in composition and in length. 

^ The first and last characters of the motif are solid characters; if “dont care” charac- 
ters are allowed at the ends, the motifs can be made arbitrarily long in size without 
conveying any extra information. 



Some Results on Flexible-Pattern Discovery 



37 



Bounding the Total Number of Multiple Occurrences. The maximum number 
of occurrences at a position is clearly bounded by n, thus the total number 
of occurrences in the location list is bounded by n^. Is this bound actually 
attained? The following is an example to show that such a bound is achieved. 
Let s be the input string that has n /2 a’s followed by n /2 b's. Consider the motif 
m = At positions 1 to n/ 2 , m occurs n /2 times in each. Thus the 

total number of occurrences is clearly J 7 (n^). 

Can Density Constraint Reduce the Number of Motifs? Density could be speci- 
fied by insisting that every realization of the motif has no more than d consec- 
utive “dont care” characters. Is it possible to have a much smaller set of motifs 
under this definition (where d is a small constant). We show with the following 
example that the density constraint does not affect the worst case situations. 
Let the input string s have the following form: 

aci C2 C3 6a Ac2 C3 bY aciXcs bYY aci C2 A6 

Then the maximal motifs (which are in number) are a...b, a.. 03b, a.C2-b, 

aci-.b, 0.02036, aoi.036, aciC2-b. Let d = 1 . Consider the input string in the last 
example. We construct a new motif by placing a new character Z between every 
two characters as follows: 

aZciZc2ZcsZbaZX Zc2ZcsZbYaZciZX Zc^ZbYY aZc\Zc2ZXZb 

The length of the string just doubles, at most whereas the number of maximal 
motifs, that have no more than one consecutive dot character is at least as many 
as it was before. 

See !Pa,rDHPa,rilDlPR,F+nnj for the motivation and a general description of 
the notion of redundancy. A formal definition is given below. 

Definition 9. (Redundant, Irredundant Motif) A maximal motif m, with loca- 
tion list Cm, is redundant if there exist maximal motifs rui, 1 < i < p, p > 1, 
such that Cm = Cmi U Cm2 ■ • ■ U Cmp A maximal motif that is not redundant is 
called an irredundant motif. 

Notice that for a rigid motif p > 1 (p in the Definition 0 ) since each location list 
corresponds to a unique motif whereas for a flexible motif p could have a value 

l . For example, let s = axfygbapgrfb. Then mi = a. [^’^ 1 / 6 , m2 = a. P- 3 ]p 6 , 

m3 = a. ...6 with Cmi = Cm2 = = {Ijf}- But m3 is redundant since 

m3 :< mi, m2. Also mi m2 and m2 m\, hence both are irredundant. 

Cenerating Operations. The redundant motifs need to be generated from the 
irredundant ones, if required. We define the following generating operations. Let 

m, mi and m2 be motifs. 

Prefix operator, P^(m), 1 < <5 < |m|: This is a valid operation when 8 is an 
integer and m[ 5 \ is a solid character, since all the operations are closed under 
motifs. P^{m) is the string given by m[l ... 6]. For example, if m = AB..CDE, 
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then P^{m) is not a valid operation since m[3] is a dot-character. Also, P^{m) = 
AB..C. 

Binary AND operator, mi ® m^- m = m\ ^ m2, where m is such that m ^ 
mi, m2 and there exists no m' with m < m' . For example if mi = 
and m2 = AB\^'^'^FG. Then, m = mi ^ m2 = A..\"^’'^^G. 

Binary OR operator, mi m2: m = mi m2, where m is such that mi, m2 
^ m and there exists no m' with m' F m. For example if mi = A..D..G and 
m2 = AB...FG. Then, m = mi {^m2 = AB.D.FG. 

3 Bounding the Irredundant Flexible Motifs 

Definition 10. (Basis) Given a sequence s on an alphabet E, let M. be the set 
of all maximal motifs on s. A set of maximal motifs B is called a basis of M iff 
the following hold: for each m G B, m is irredundant with respect to B — {m}, 
and, let G(A) be the set of all the redundant maximal motifs generated by the 
set of motifs X , then A4 = G(B). 

In general, |A1| = 17(2"). The natural attempt now is to obtain as small a 
basis as possible. 

Theorem 1. Let s be a string with n = |s| and let B be a basis or a set of 
irredundant motifs. Then \B\ < 3n. 

Proof. A proof for the specia l case whe n all the motifs are rigid, i.e., defined 
on A U {.} appeared in |MarfiiiPBF+nnj . The framework has been extended to 
incorporate the flexible motifs. Consider B* {B C B*) where the motifs in B* are 
not maximal and redundant. Every position, x G Cmi , Bm2 ) • ■ • ) , is assigned 

ON/OFF with respect to m as follows: If m^, I < i < Z, is such that there exists 
no j f, I < _) < Z so that m^ ^ mj holds, then x is marked ON, otherwise it 
is marked OFF w.r.t. m^. Further if Cmi = = ... = Cmi, then the position 

is marked ON w.r.t. each of m^, 1 < i < Z. We make the following claims due to 
this ON/OFF marking: 

Claim. At the end of this step, every motif m that is not redundant, has at least 
one location x G Cm marked ON w.r.t. m. 

Proof. This follows directly from the definition of redundancy. Also a redundant 
motif m may not have any position marked ON w.r.t m. o 

Definition 11. (La Straddles Ch) A set La straddles a set Cb if Ca C\ Cbf^ 4>, 
Ca Cb yf and Cb Ca yf (f. 

Notice that if Ca straddles Cb, then Cb straddles Ca. 

Claim. If location x is marked ON w.r.t. motifs mi, m 2 , ■ ■ . ,rni, where no two 
location lists are identical, then every pair of location lists Crm , Cmj , i j must 
straddle. 
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Proof. Assume this does not hold, then C £m.j, for some i and j. In that 
case the location x is marked OFF w.r.t. rrij which is a contradiction. o 

For each motif m, define c{m) to be the charge which is a positive integer. 
This is initialized to 1 for every motif. In the counting process, when there is 
difficulty in accounting for a motif m, a charge or count for m is assigned at 
some other other motif m': thus m! would account for itself and all the other 
motifs whose charge it is carrying (thus m' is the banker, as defined in the 
next step, for all these other motifs). For each motif to, define B(m) to be the 
banker of to, which is a motif m' that is carrying the charge for to. For each to, 
initialize B{m) = to. Every motif is marked LIVE/DEAD. At the initialization 
step every motif that is not redundant (see Claim E) is marked LIVE. If there 
exists a position x that is marked ON w.r.t only one LIVE motif to, to is marked 
DEAD. Repeat this process until no more motifs can be marked DEAD. 

In some sense, every DEAD motif at this stage is such that there is a unique 
position {x of last paragraph), that can be uniquely assigned to it. The number 
of DEAD motifs < n. 

We begin by introducing some more definitions. 

Definition 12. (Instance) An instance of a realization of a motif m is the motif 
at some location x S Cm on the input string s. 

For example, let s = abccdabed and let to = abM^'^^d. Then one instance of to on 
s, shown in bold, is abccdofoed and the other instance is abccdahed. The solid 
characters in the instances of to, shown with a bar, are as follows: abccda&ed in 
the first and abcdahed in the second. 

Definition 13. (i-Connected) An instance of a realization of a motif mi is i- 
connected to an instance of a realization of a motif m2 if the two instances have 
at least i common solid characters. 

Let TO® be an instance of TOq where x G Cma ■ To avoid clutter we refer to an 
instance of motif ma simply as fha- 

Lemma 2. Consider an instance each of a set of motifs TOi,TO2,...,to/ such 
that for that instance of the realization of mi, the starting position x G Cmi is 
marked ON w.r.t mi, 1 < f < and, for every motif mi there exists a realization 
of motif mj, j ^ i, 1 < i,j < I, such that the two instances are 2 -connected, 
then there exist distinct positions ji, J2, ■ ■ ■ , ji on the input string s, with the 
corresponding positions, 31^)2, such that mi[j'i], TO2[j2], . ■ ■, rni[j[] are 

solid characters. 

Proof. Assume this does not hold, then there exists instances of realizations 
of motifs TOj^jTOj,,, as rn)^, to^^ respectively 1 < ja,jb < I with m)^ ^ 
Consider m')^ , a realization of the sub-motif of to)^ which starts at the starting 
position on mj,^ and ends at the ending position of mj,^ . If the position w.r.t. m”^ 
is ON, it is a contradiction since then the position at which mj,^ is incident must 
be marked OFF. However, if the position w.r.t. to"_ is OFF, then there exists 
an instance of a realization of to, , m( such that to" ^ to( But m( ^ to( 
and both are ON, which is again a contradiction. o 
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Next, we define an operational connectedness on ON-marked instances of 
LIVE motifs rria and mb, called the o- connectedness which holds if fha is 2- 
connected to fhb, or. there exists fhc, where the instance is ON-marked w.r.t 
LIVE motif me, and rha is o-connected to fhc and fhc is o-connected to fhb- 

Lemma 3. o- connectedness is an equivalence relation. 

Proof. It can be easily verified that o-connectedness is reflexive, symmetric and 
transitive. Thus all the ON-marked instances of the LIVE motifs can be parti- 
tioned into equivalence classes. . o 

Claim. Using Lemmas 0 and 0 every instance of a realization of a LIVE mo- 
tif ma has a solid character at position ja associated with it. Let D{fha) = 

Charging Scheme. We next describe a charging (or counting) scheme by which 
we count the number of motifs. This is best described as an iterative process as 
follows. 

While there exists position a; on s such that x is marked ON w.r.t LIVE 
motifs mi, m 2 , . . . ,mi, I > 1, dp the following for 1 < i < /: 

1. Let B{mi) = D{fhi) (see Step 1.3 and Claim 0. 

2. c{B{mi)) = c{B{mi)) + c{mi) (see Step 1.2). 

3. Mark mt DEAD (see Step 1.4)|3. 

Claim. The loop terminates. 

Proof. At every iteration at least two distinct LIVE motifs are marked DEAD, 
hence the loop must terminate. o 

Claim. At the end of the loop, all the LIVE motifs are such that for every pair 
mi,mj\ Lrrii and Cmj do not straddle and mj mi, without loss of generality. 

Proof. The first condition holds obviously since otherwise the loop would not 
terminate since x G Lrm C Cmj would be marked ON w.r.t mi and mj. The 
second condition also hold obviously since if m,- -A mi, then motif m, is marked 
DEAD (Step 1). o 

Next, we need to show that the charge c(m) carried by every LIVE mo- 
tif m at the end of the loop, can be accounted for by Cm- In this context we 
make the following observation about the charge: the iterative assignment of 
charges to a motif has a tree structure. This is best described using an ex- 
ample: see Figure 0 Each level of the tree corresponds to an iteration in the 
loop. For instance, the top level in the left tree denotes that at iteration 1, 
B{a.bxyc..ab) = B{xyxyc..ab) = B{c...dxyc..ab) = xyc..ab and c{xyc..ab) = 
1 -I- c{a.bxyc..ab) -f- c{xyxyc..ab) -I- c{c...dxyc..ab). At the end of this iteration 
motifs a.bxyc..ab, xyxyc..ab and c...dxyc..ab are marked DEAD. At the second 
iteration, B{xyc..ab) = yc..ab and c{yc..ab) = 1 -|- c(j;yc..a6) and motif xyc..ab is 
marked DEAD and so on. 



3 



The only exception is made when B(mi) = mi. In this case mi remains LIVE. 
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Claim. Let L denote the number of leaf nodes (nodes with no incoming edges) 
in the charge tree of motif m at the end of the while loop, then \Cm\ > L. 

Proof. Such a claim holds since we know that by our choice of B{.), if B{mi) = 
B(m, 2 ) = ... = B(rni) = m' then by Lemma El must have I distinct in- 
stances, each instance in a distinct equivalent class of realization of motif in- 
stances (Lemma EJ- However, the instance of m' may not be distinct from each 
of these instances; hence the non- leaf nodes may not be accounted for but the 
leaf nodes are. Hence \Cm\ > L. o 

At an iteration if a motif m is charged by more than one motif (or in the 
charge tree, the node has more than one incident edge), m is certainly maximal. 
However if it is charged by exactly one motif then it may or may not be maximal; 
if it is maximal, it must have an extra instance. We use the following folk-lore 
lemma to bound the size of /, the number of non-leaf nodes in the charge-tree. 

Lemma 4. Given a tree T , where each node, except the leaf nodes, must have at 
least two children, the number of non-leaf nodes, I is no more than the number 
of leaf nodes L. 

We are not interested in counting non-maximal motifs and these are the only 
motifs that contribute to a single child for a node in a tree. Thus the number of 
maximal motifs that were marked LIVE at the start of Step 2 is no more than 
2n, using Claim 0 and Lemma0 



xyxyc..ab c() - 1 



a.bxyc..ab 




c...dxyc..ab 

c() = l 



c() = 4 Oxyc..ab 



c() = 1 Qxyxyc-.ab 



c() = 2 (5 xyc..ab 




c() = 3 f) yc..ab 
c() = 4 c..ab 



c() = 5 (^ab 



Fig. 1. Two examples showing the different steps in the assignment of charge to motif 
ab: Every level of the tree corresponds to an iteration in the while loop. The dashed edge 
indicates that the motif at the “to” end of the edge could possibly be non-maximal. 
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Using Step 1.4, we have number of maximal and non-redundant motifs < 
{n + 2n) . This concludes the proof of the theorem. A 

Corollary 1. Given a string s, the basis B is unique. 

Assume the contrary that there exists two distinct bases, B and B' . Without 
loss of generality, let to G K and m ^ B' . Then Cm = Cmi U Cm-z U . . . U /ImpU, 

rrii C B' , 1 < i < p. Now, Cmi = ^ Hence m ^ B, which is 

a contradiction. A 

Corollary 2. Given a string s, let A4 be the set of all motifs. Let A4' C j\4 
be an arbitrary set of maximal motifs. Then the basis B’ of M! is such that 
\B'\ < 3n. 

This follows immediately since the proof of the theorem does not use the fact 
that A4 is the set of all motifs of s. It simply works on the location lists (sets) 
of this special set of maximal motifs. A 

Algorithm to Detect the Irredundant Motifs. The algorithm is exactly along 
the lines of the one presented in jPa.r ggipRp+ooi for the rigid motifs with the 
following exceptions. At the very first step all the rigid motifs with exactly 
two solid characters are constructed. The rigid motifs are consolidated to form 
flexible motifs. For example, motifs a.b, a..b, a...b are consolidated to form 
Also [1,31b is set to Ca.t U Ca..b U Ca...b- Also, each location is annotated for 
multiple occurrences, ie. all the distinct end locations of the realization of the 
motifs are stored at that location. For example if x G Ca.b,Ca..b, then x is 
annotated to note that the motif occurs twice at that location ending in two 
different positions. When a pair of motifs are considered for concatenation, these 
multiple occurrences are taken into account. 

Lemma 5. The algorithm takes O(n^logn) time. 

Proof. At each iteration there are only 0(n) motifs after the pruning. Due to 
the multiple occurrences, there are 0{n^) comparisons made at each iteration. 
Since the location list can be no more that n each with at most n occurrences, 
the amount of work done is 0{n^I), where I is the number of iterations. But 
I = logL where L is the maximum number of solid characters in a motif, since 
the motif grows by concatenating two smaller motifs. But L < n, hence the 
result. A 

Generalizations to Sets of Gharacters and Real Numbers. All of the discus- 
sion above can be easily generalized to dealing with input on sets of charac- 
ter and even real numbers. In the latter case, two real numbers r\ and V 2 are 
deemed equal if for a specified S, \ri — r 2 | < S. Due to space constraints, we do 
not elaborate on this further, the details of handling rigid patterns appear in 
|Par99lPBF+0()| . 
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4 Conclusions 

We have presented a uniform framework to study the problem of pattern dis- 
covery. We have separated the application from the task of discovering patterns 
in order to better understand the space of all patterns on a given string. The 
framework treats uniformly rigid and flexible motifs including strings of sets of 
characters and even real numbers. Through appropriate definitions of maximal- 
ity and irredundancy we show that there exist a small set of crucial motifs called 
the basis. Since a polynomial time algorithm has been presented to detect the ba- 
sis, this has algorithmic implications as well. It is possible that this basis would 
be critical in designing efficient algorithms to detect all or significant motifs 
depending on the application. 
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Abstract. Ambiguity in dynamic programming arises from two inde- 
pendent sources, the non-uniqueness of optimal solutions and the partic- 
ular recursion scheme by which the search space is evaluated. Ambiguity, 
unless explicitly considered, leads to unnecessarily complicated, inflex- 
ible, and sometimes even incorrect dynamic programming algorithms. 
Building upon the recently developed algebraic approach to dynamic 
programming, we formalize the notions of ambiguity and canonicity. We 
argue that the use of canonical yield grammars leads to transparent and 
versatile dynamic programming algorithms. They provide a master copy 
of recurrences, that can solve all DP problems in a well-defined domain. 
We demonstrate the advantages of such a systematic approach using 
problems from the areas of RNA folding and pairwise sequence compar- 
ison. 



1 Motivation and Overview 

1.1 Ambiguity Issues iu Dyuamic Programmiug 

Dynamic Programming (DP) solves combinatorial optimization problems. It is 
a classical programming technique throughout computer science [3], and plays 
a dominant role in computational biology [4, 10]. A typical DP problem spawns 
a search space of potential solutions in a recursive fashion, from which the final 
answer is selected according to some criterion of optimality. If an optimal solution 
can be derived recursively from optimal solutions of subproblems [1], DP can 
evaluate a search space of exponential size in polynomial time and space. 

Sources of Ambiguity. By ambiguity in dynamic programming we refer to the 
following facts which complicate the understanding and use of DP algorithms: 

— Co-optimal and near-optimal solutions: It is well known that the “optimal” 
solution found by a DP algorithm normally is not unique, and there may be 
relevant near-optimal solutions. A single, “optimal” answer is often unsat- 
isfactory. Considerable work has been devoted to this problem, producing 
algorithms providing near-optimal [15,17] and parametric [11] solutions. 

— Duplicate solutions: While there is a general technique to enumerate all 
solutions to a DP problem (possibly up to some threshold value) [21,22], 
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such enumeration is hampered by the fact that the algorithm may produce 
the same solution several times - and in fact, this may lead to combinatorial 
explosion of redundancy. Heuristic enumeration techniques, and post-facto 
filtering as a safeguard against duplicate answers are employed e.g. in [23]. 

— (Non-) canonical solutions: Often, the search space exhibits additional redun- 
dancy in terms of solutions that are represented differently, but are equiva- 
lent from a more semantic point of view. Canonization is important in eval- 
uating statistical significance [14], and also in reducing redundancy among 
near-optimal solutions. 



Ambiguity Examples. Strings aaaccttaa and aaaggttaa are aligned below. 
Alignments (1) and (2) are equivalent under most scoring schemes, while (3) 
may even be considered a mal-formed alignment, as it shows two deletions sep- 
arated by an insertion. 

aaacc — ttaa aaa — ccttaa aaac — cttaa 

aaa — ggttaa aaagg — ttaa aaa-gg-ttaa 

( 1 ) ( 2 ) ( 3 ) 

In the RNA folding domain, each DP algorithm seems to be a one-trick pony. 
Different recurrences have been developed for counting or estimating the num- 
ber of various classes of feasible structures of a sequence of given length [12], 
for structure enumeration [22], energy minimization [25], and base pair maxi- 
mization [19]. Again, enumerating co-optimal answers will produce duplicates in 
the latter two cases. In [4](p. 272) a probabilistic scoring scheme is suggested 
to find the most likely RNA secondary structure - this is a valid idea, but will 
work correctly only if the underlying recursion scheme considers each feasible 
structure exactly once. 



Our Main Contributions. The recently developed technique of algebraic dynamic 
programming (ADP), summarized in Section 2, uses yield grammars and eval- 
uation algebras to specify DP algorithms on a rather high level of abstraction. 
DP algorithms formulated this way can have all the ambiguity problems illus- 
trated above. However, the ADP framework also helps to analyse and avoid these 
problems. 

1. In this article, we devise a formal framework to explain and reason about 
ambiguity in its various forms. 

2. Introducing canonical yield grammars, we show how to construct a “master 
copy” of a DP algorithm for a given problem class. This single set of recur- 
rences can correctly and efficiently perform all analyses in this problem class, 
including optimization, complete enumeration, sampling and statistics. 

3. Re-use of the master recurrences for manifold analyses provides a major 
advantage from a software-engineering point of view, as it enhances not only 
programming economy, but also program reliability. 
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2 A Short Review of Algebraic Dynamic Programming 

ADP introduces a conceptual splitting of a DP algorithm into a recognition 
and an evaluation phase. A yield grammar is used to specify the recognition 
phase (i. e. the search space of the optimization problem). A particular parsing 
technique turns the grammar directly into an efficient dynamic programming 
scheme. The evaluation phase is specified by an evaluation algebra, and each 
grammar can be combined with a variety of algebras to solve different problems 
over the same data domain, for which heretofore DP recurrences had to be 
developed independently. 

2.1 Basic Notions 

Let A be an alphabet. A* denotes the set of finite strings over A, and ++ denotes 
string concatenation. Throughout this article, x,y & A* denote input strings to 
various problems, and |x| = n. A subword is indicated by its boundaries - 
denotes Xi+i...Xj. 

An algebra is a set of values and a family of functions over this set. We shall 
allow that these functions take additional arguments from A * . An algebraic data 
type T is a type name and a family of typed function symbols, also called oper- 
ators. It introduces a language of (well-typed) formulas, called the term algebra. 
An algebra that provides a function for each operator in T is a T -algebra. The 
interpretation tx of a term t in a T-algebra X is obtained by substituting the 
corresponding function of the algebra for each operator. Thus, tx evaluates to a 
value in the base set of X. 

Terms as syntactic objects can be equivalently seen as trees, where each 
operator is a node, which has its subterms as subtrees. Tree grammars over T 
describe specific subsets of the term algebra. A regular tree grammar over T [2, 7] 
has a set of nonterminal symbols, a designated axiom symbol, and productions 
of the form A — > t where A is a nonterminal symbol, and t is a tree pattern, 
i.e. a tree over T which may have nonterminals in leaf positions. 



2.2 ADP 




Simple hairpin 
structure 



The Declarative Level 

In the sequel, we assume that T is some fixed data 
type, and A a fixed alphabet. As a running exam- 
ple, let A = {a, c, g, u}, representing the four 
bases in RNA, and let T consist of the operators sr , 
hi, bl, br, il, representing structural elements in 
RNA: stacking regions, hairpin loops, bulges on the 
left and right side, and internal loops. Feasible base 
pairs are a-u, g-c, g-u. The little hairpin 
denoted by the term 




aaa 



hi 

/ |\ 

a I u 

gugu 



Hairpin in 
term 

representation 



s = sr ’c’ (sr ’c’ (bl "aaa" (hi ’a’ "gugu" ’u’)) ’g’) ’g’ 
is one of many possible 2D structures of the RNA sequence ccaaaaguguugg. 
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Definition! (Evaluation Algebra). An evaluation algebra is a T -algebra 
augmented by a choice function h. If I is a list of values of the algebra’s value set, 
then h{l) is a sublist thereof. We require h to be polynomial in |^|. Furthermore, 
h is called reductive if\h{l)\ is bounded by a constant. 

The choice function is a standard part of our evaluation algebras because 
we shall deal with optimization problems. Typically, h will be minimization or 
maximization, but, as we shall see, it may also be used for counting, estimation, 
or some other kind of synopsis. Non-reductive choice functions are used e.g. for 
complete enumeration. The hairpin s evaluates to 3 in the basepair algebra, and 
(naturally) to 1 in the counting algebra: 

basepair_alg = (sr ,hl ,bl,br , il ,h) counting_alg = (sr ,hl ,bl,br , il ,h) 



where sr 


_ X _ 


= x+1 


where sr 


_ X _ 


= X 


hi 


_ X _ 


= 1 


hi 


_ X _ 


= 1 


bl 


X 


= X 


bl 


X 


= X 


br 


X 


= X 


br 


X 


= X 


il 


_ X _ 


= X 


il 


_ X _ 


= X 


h 




= maximum 


h 




= sum 



Definition 2 (Yield Grammar). A yield grammar {Q,y) is given by 

— an underlying algebraic datatype T , and alphabet A, 

— a homomorphism y :T ^ A* called the yield function, 

— a regular tree grammar Q over T. 

L{Q) denotes the tree language derived from the axiom, and y{G) := {y{t)\ t G 
>C(tJ)} is the yield language ofQ. 

The homomorphism condition means that y{Cxi...Xn) = y{xi)++...++y{xn) for 
any operator C of T. For the hairpin s, we have y(s) = ccaaaaguguugg. By 
virtue of the homomorphism property, we may apply the yield function to the 
righthand sides of the productions in the tree grammar. In this way, we obtain 
a context free grammar y{G) such that y{G) = £{y{G)). 

Definition 3 (Yield Parsing). The yield parsing problem of (G,y) is to com- 
pute for a given s G A* the set of all t G T such that 

y{t) = s. 

Definition 4 (Algebraic Dynamic Programming). Let I be a T -algebra 
with a reductive choice function hj. Algebraic Dynamic Programming is com- 
puting for given s G A* the set of solutions 

hx{tx I y(t) = s} in polynomial time and space. 

This definition precisely describes a class of DP problems over sequences. All 
biosequence analysis problems we have studied so far fall under the ADP frame- 
work. Outside the realm of sequences, DP is also done over trees, dags, and 
graphs. It is open whether the concept of a yield grammar can be generalized 
to accommodate these domains. A detailed discussion of the scope of ADP is 
beyond the space limits of this short review. 
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2.3 ADP The Notation 

Once a problem has been specified by a yield grammar and an evaluation algebra, 
the ADP approach provides a systematic transition to an efficient DP algorithm 
that solves the problem. To achieve this, we introduce a notation for yield gram- 
mars that is both human readable and — executable! In ADP notation, yield 
grammars are written in the form 

hairpin = axiom struct where 
struct = open I I I closed 
open = bl <<< region closed I I I 

br <<< closed region I I I 

il <<< region closed region 

The grammar hairpin has axiom struct and further nonterminal symbols 
open, closed, base denotes an arbitrary base, and region a nonempty se- 
quence of bases from the RNA alphabet. The grammar notation is refined further 
by allowing predicates and the choice function to be associated with nontermi- 
nals symbols and productions: 

closed = 

((hi <<< base (region ‘with* minsize 3) base I I I 

sr <<< base (closed I I I open) base) ‘with' basepair) . . . h 

This production uses two predicates: minsize k requires a yield of minimal 
length k, and basepair applies to both alternatives, requiring that the bounding 
bases of either closed structure form a feasible base pair. The choice function 
h is attached via the . . . -combinator, indicating that from several alternative 
closed structures, a selection according to h is imposed. 

In the syntactic view of yield grammars, we interpret the operators hi , 
sr, . . . in the term algebra. They merely construct terms or trees represent- 
ing hairpins. In this view, the choice function h has little use and should be 
assumed to be the identity function. However, in a more semantic view, we see 
hi , sr , . . . as functions of some evaluation algebra. Then, the “trees” gener- 
ated by the grammar are actually formulas that can be evaluated. In this view, 
the grammar is a mechanism to generate a set of values, and it makes sense to 
apply the algebra’s choice function to select (say) a maximal one. 

2.4 ADP The Implementation Level 

We now solve the yield parsing problem. A nondeterministic, top-down parser 
for a context-free grammar is easily obtained by the combinator technique of 
[13]. This idea is adapted to yield grammars. A yield parser pN for nonterminal 
N takes a subword (z, j) of x as its argument and returns the set pN{i,j) = 
{t\y(t) = X(ij-)}. Technically, it returns a list; when the list is empty, we say 
that the parser fails. Where the operators of T take strings from A* as their 
arguments, suitable parsers must be provided. 

The grammar itself is turned into a parser by defining the combinators as 
higher-order functions which compose complex parsers from simpler ones. For 
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the sake of completeness, definitions are given here, but space does not allow 
a thorough discussion. We use list comprehension notation borrowed from the 
functional programming language Haskell. 



(r I I I q) (i,j) 

(f <<< q) (i,j) 

(r q) (i,j) 

(r ... h) (i, j) 
axiom q 

(r ‘with* w) (i,j) 



r(i,j) ++ q(i,j) 

[f z I z <- q(i,j)] 

[f y I k <- [i+l..j-l], f <- r(i,k), y <- q(k,j)] 

h(p(i,j)) 

q(0,n) 

if w(i,j) then r(i,j) else [] 



Note that the axiom- and the with-clause are also defined as functions applied 
to parsers. With these definitions, a grammar like hairpin is now an executable 
yield parser, albeit of miserable efficiency: There may be an exponential number 
of parses, and any subparse is constructed many times. This is alleviated by tab- 
ulating the parser functions. Let p be a table indexing function and tabulated 
be a tabulation function such that 

p (tabulated f) (i,j) = f(i,j), or equivalently 
p (tabulated f) = f 

With this convention, a grammar may be annotated for efficiency, replacing 
parsers by tables. Choosing to tabulate the parser for nonterminal closed, gram- 
mar hairpin now reads 

hairpin = axiom struct where 
struct = open Nip closed 

open = bl <<< region p closed I I I 

br <<< p closed region I I I 

il <<< region p closed region 

closed = tabulated ( 

((hi <<< base (region ‘with* minsize 3) base I I I 

sr <<< base (p closed I I I open) base) ‘with' basepair) . . . h) 

Such annotation does not affect the meaning of the grammar, nor that of the 
parser. It only affects the parser’s efficiency: The parser now uses dynamic pro- 
gramming. In general, the parser consists of a family of recursively defined tables 
and functions. Substituting the definitions of the combinators and the functions 
of a specific evaluation algebra, the annotated grammar simplifies to a set of 
recurrences as we traditionally see it in dynamic programming. 



2.5 Two Classical DP Algorithms in ADP Notation 

Zuker’s Algorithm for RNA Folding. Zuker and Stiegler [25] gave a DP algorithm 
for determining the minimal free energy structure of an RNA molecule under the 
nearest neighbour model. The model and the algorithm have been elaborated 
considerably since then, but for lack of space, we base our discussion on the 
original description. Evers [5] has recently reformulated Zuker’s recurrences as a 
yield grammar (Jzukersi ^ : 

^ This example shows actually executable ADP code, and contains a few refinements 
not explained in Section 2. The variants of the -operator are all equivalent in 
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zukerSl algebra inp = axiom struct where 

(str ,hl ,bi , sr ,bl ,br , il, ol , ox, CO ,h) = algebra 

— nonterminals v and w are Zuker’s tables V and W. 



struct = str <<< p w 

V = tabulated ( 



( (hairpin 


1 1 1 


twoedged I I I bifurcation) ‘with* basepair 


) 


. h) 


hairpin = 


hi 


<<< 


base (region 


‘with* minsize 3) ~~- 


base 




bifurcation = 


bi 


<<< 


base p w 


p w base 




. h 


twoedged = 


stack 


1 1 1 bulgeleft 1 1 1 


bulgeright I I I interior 


. h 


stack = 


sr 


<<< 


base 


p V 


base 




bulgeleft = 


bl 


<<< 


base region 


p V 


base 




bulgeright = 


br 


<<< 


base 


p V region 


base 




interior = 


il 


<<< 


base region 


p V region 


base 





w = tabulated ( openleft I I I openright III p v III connected . . . h) 
openleft = ol <<< base p w 
openright = ox <<< p w base 

connected = co <<< p w p w ... h 

This grammar uses two essential nonterminals, v and w; the others are in- 
troduced to reflect Zuker’s case analysis. It is quite instructive to reformulate 
classical DP algorithms in the uniform ADP framework. Making explicit the 
grammar behind the algorithm helps to clarify properties relating to ambiguity 
as well as efficiency. 



The Needleman-Wunsch Algorithm of 1970. The Needleman-Wunsch algorithm 
for pairwise sequence comparison [18] is based on a particularly simple yield 
grammar with a single nonterminal symbol alignment, terminals xbase , ybase , 
region, empty, and the algebra represented the five operators replace, 
delete , insert , nil , h. When sequences x and y are to be aligned, the input 
to this parser is x++y~^ . 



nw_alignment algebra x y = axiom alignment where 

(replace, delete, insert, nil, h) = algebra 
alignment = tabulated ( 

replace <<< xbase p alignment ybase 

delete <<< region p alignment 

insert <<< p alignment region 

nil ><< empty 



. . . h) 



3 Ambiguity and Canonicity 

3.1 Formalizing Ambiguity and Canonicity 

Remember that a context-free grammar Q is ambiguous, if there are different 
leftmost derivations for some x G T{Q). 



the declarative view, but operationally they are special cases with a more efficient 
implementation. E.g., is used when the righthand parser accepts a single base. 
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Definition 5 (Yield Grammar Ambiguity). A tree grammar Q is ambigu- 
ous if there are different leftmost derivations for some tree t G ^{G)- A yield 
grammar {G, y) is ambiguous, if G is ambiguous, otherwise it is unambiguous. 
A yield grammar {G, y) is strictly unambiguous, if it is unambiguous and y is 
injective. 

Strict unambiguity means that for each s € A*, we have at most one t G hL{G) 
such that yft) = s. Hence, we do not have an optimization problem at all. Strictly 
unambiguous yield grammars play no part in dynamic programming. 

Canonicity means that all solutions from which we want to choose an optimal 
one have a unique representation in the search space. For example, alignments 
as shown in Sect. 1.1 could be canonized by requiring that deletions are arranged 
always before adjacent insertions. To formalize canonicity, we must introduce a 
canonical model as the point of reference. 

Definition 6 (Canonical Models and Canonical Yield Grammars). Let 

K. be a set, the canonical model. Let k be a mapping from L{G) to 1C. A yield 
grammar {G, y) is canonical w.r.t. K. and k if it is unambiguous and the mapping 
k is bijective. A DP algorithm is canonical w.r.t. K. and k, if the underlying yield 
grammar is canonical w.r.t. K. and k. 

The canonical model may exist merely in the mind of the algorithm designer, 
but preferably, it should be formulated explicitly, together with the mapping k. 

3.2 Analysing Canonicity 

We show that the Zuker algorithm is not canonical. A canonical model for RNA 
secondary structures would be sets of properly nested base pairs. Such a model 
is too remote from the tree-like representation of RNA structures. The Vienna 
notation, encoding a structure as a string of dots and properly nested parenthe- 
ses, however, proves to be very convenient. It can be formally defined as C{V), 
using the string grammar V = {i? — > .|..|S', S ...|.S'|S'.|S'S'|(S')}. Our little 
hairpin s would be denoted by the pair ccaaaaguguugg" ) . 

The mapping k from Zuker’s underlying data type Z to L{V) is defined via 

k{bi{a,u,v,b) = "(" ++ k{u) ++ k{v) ++ ")" 
k{ol{a, v) = " . " ++ k{v) 
k{co{u,v) = k{u) ++ k{v) 
k{ox{u,b) = k{u) ++ 

Further equations are omitted, as these suffice to prove the equalities below. 

Theorem 1. The Zuker DP algorithm for RNA folding is not canonical with 
respect to feasible RNA structures. 
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Proof. We observe the equalities 

k{ol{a, ox{w, b))) = k{ox{ol{a, w), b) (1) 

k{co{u,co{v,w))) = k{co{co{u,v),w)) (2) 

k{ol{u,co{v,w))) = k{co{ol{u,v),w)) (3) 

k{bi{a, u, co{v, w), b)) = k{bi{a, co{u, v), w, b)) (4) 

k{bi{a, ox{u, b), w, c)) = k{bi{a, u, ol{b, w), c)) (5) 

Either one of these proves that k is not injective. 



While equalities (1) and (2) are quite obvious and easy to avoid, (3) - (5) are 
more subtle, and there may be more such equalities. 

The degree of redundancy incurred by the non-canonical grammar is demon- 
strated in Section 4.4. Such redundancy is not an efficiency problem, as the 
asymptotic efficiency of a DP algorithm is not affected. However, it makes it 
impossible to use the same recurrences for other purposes, say for the enumer- 
ation of all suboptimal solutions. This explains why Zuker’s algorithm employs 
an incomplete heuristics when enumerating suboptimal foldings. 



4 Master Recurrences for RNA Folding 

4.1 A Canonical Grammar for Feasible RNA Structures 

In [8] a data type tFS is given together with a grammar Gf of all feasible struc- 
tures. tFS extends T as used above by operators ss and ml representing single 
stranded and multiloop structures, plus cons and ul for constructing compo- 
nent lists. It is easy to show by induction that C{Gf) C !FS, and there is a 
canonical mapping k : C{Gf) — *■ >C(V). Another grammar for feasible structures 
is implicitly given by the recurrences developed in [22]. These recurrences are 
designed for canonicity, since the authors seek a complete and non-redundant 
enumeration of suboptimal structures. We do not show either grammar here as 
we plan to go one step further, which will provide a significant reduction in the 
number of structures to be considered. 

4.2 A Canonical Grammar for Canonical RNA Secondary 
Structures 

Although the energy model permits structures of minimal free energy with iso- 
lated (unstacked) base pairs, there are good biophysical arguments to consider 
such structures unrealistic. As already noted by Zuker and Sankoff in [24]^, re- 
moving such redundant structures from the search space is the key to obtaining 
more significant near-optimal solutions. 

^ Zuker and Sankoff suggest an even stronger restriction to structures with maximal 
helices. Such recurrences are within the scope of the ADP approach [5], but well 
beyond the space limits of this article. 
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Definition 7. An RNA structure without isolated base pairs is canonical. 

The canonical model suiting this definition is defined as C{W) x A* using the 
string grammar yv = {R ^ eM-IS', S ...|.S'|S'.|S'S'|((P)), P S'KP)}. 
(d, s) G /C is subject to the restriction that bases in s can pair as indicated 
by matching parentheses ind. The following grammar Qc for canonical RNA 
structures is based on the data type TS. It uses an algebra with several base 
sets, and an overloaded choice function h. 
canonicals alg x = axiom struct where 
(str , ss ,hl , sr ,bl ,br , il,ml ,nil , cons ,ul,h, ) = alg 
singlestrand = ss <<< region 



struct = str <<< p comps 




1 1 1 




str <<< (ul <<< 


singlestrand) 


1 1 1 




str <<< (nil ><< 


empty) 




. . h 


comps = tabulated (cons 


«< p block 


p comps 1 1 1 




ul 


<<< p block 


1 1 1 




cons 


<<< p block 


(ul «< singlestrand) . 


. . h) 



block = tabulated (p strong I I I bl «< region p strong . . . h) 



strong = tabulated 

(((sr <<< base ( p strong Nip weak) base) 

‘with* basepair) . . . h) 



weak = tabulated 

(((hairpin I I I leftB I I I rightB I I I iloop I I I multiloop) 

‘with' basepair) . . . h) 



where 

hairpin = hi <<< base (region ‘with* minsize 3) base 

leftB = sr <<< base (bl «< region p strong) base 

rightB = sr <<< base (br «< p strong region) base 

multiloop = ml <<< base (cons «< p block p comps) base 

iloop = sr <<< base (il «< region p strong 

region) base 



The grammar distinguishes substructures closed by a single base pair (weak) 
from those closed by at least two stacked pairs (strong). If we identify these 
two nonterminals and merge their productions, an ADP version of the Wuchty et 
al. DP recurrences [22] is obtained. Note how the grammar takes care that sin- 
gle strands and closed components alternate in multiloops, and that multiloops 
contain at least two branches. 

We now specify the canonical mapping k : C{Qc) >C(W) 



k (str cs) = k’ ’ cs 

k (hi bl r b2) = "(" ++ k’ r ++ ")" 

k (sr bl s b2) = "(" ++ k’ s ++ ")" 

k (ml bl cs b2) = "(" ++ k” cs ++ ")" 
k (bl r s) = k’ r ++ k s 

k (br s r) = k s ++ k’ r 

k (il rl s r2) = k’ rl ++ k s ++ k r2 

k (ss r) = k’ r 

k’ r = [’ . ’ I b <- r] — a sequence of |r| dots 

k’ ’ cs = concat (map k cs) — concatenating (k c) for all c in cs 
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Theorem 2. Grammar Qc is a canonical yield grammar for canonical RNA 
secondary structures. 

Proof. We have to show that (a) the grammar Qc is unambiguous, and (b) 
the mapping k is bijective. (a) is shown by induction on the derivations of the 
grammar. For (b), injectivity of k is shown by structural recursion, while surjec- 
tivity uses a string grammar k{Qc) (in analogy to y{G) in Sect. 3) to show that 
T(yV’) C C{k{Qc)). Details are omitted. 

4.3 Efficiency 

A canonical grammar, whether encoded in ADP or in conventional matrix recur- 
rences, may require some extra tables compared to its non-canonical counterpart, 
in order to keep more structures distinct. In our RNA example, the non-canonical 
grammar zukerSl uses 2 tables, while the canonical grammar canonicals uses 
4. This is the price for the added versatility. 

4.4 Applications 

Due to Theorem 2 we know that the DP algorithm Qc considers each canonical 
RNA structure exactly once. Hence it can serve as a “master copy” of all DP 
algorithms which can be formulated as a iF5-algebra. 

Simple Evaluation Algebras. The analyses in Table 1 can be defined each in a 
few lines: Energy minimization for canonical structures has been designed and 



Purpose 


Value Domain 


Interpretation 
of Operators 


Choice 

Function 


Energy minimization 


energy values 


energy rules for 
hairpin loops, bulges, 
stacked pairs, etc. 


minimi- 

zation 


Structure enumeration 


trees in data 
type T 


tree constructors 
HL, IL, SR, etc. 


identity 

function 


Structure counting 


Integers 


multiply counts 
of substructures 


summation 


Structure count 
estimation 


Reals 


multiply counts 
by pairing prob. 


summation 



Table 1. Different analyses based on Qc 



implemented in [5] and [16]. Structure enumeration has been used to generate 
visualizations of the folding space via RNA-Movies [6]. Structure counts are 
correct due to canonicity of the grammar, and canonization of structures proves 
a dramatic reduction of the folding space. A probabilistic estimate for feasible 
structures obtained in this manner^ was already shown in [8] to be remarkably 

® Equivalently based on the canonical grammar for feasible structures or on the special 
recurrences given in [24], but modified to reflect base composition. 
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accurate. Accuracy is confirmed for the estimate of canonical structures supplied 
here. The Waterman formula [20] for the number of possible structures for all 
sequences of length n can also be written as a simple yield grammar. 



Some Observations. A few of the observations made by applying these evaluation 
algebras are summarized in Table 2, showing structure statistics for initial seg- 
ments of an RNA sequence^ from neurospora crassi. n denotes sequence length, 
and the columns list structures counted, estimated, or evaluated by the various 
algorithms. These figures also indicate that the majority of structures accounted 



n 


Waterman 

formula 


Zuker 

algorithm 


Probabilistic 

estimate 

feasibles 


Feasible 

structs. 


Canonical 

structs. 


Probabilistic 

estimate 

canonicals 


5 


8 


0 


1.16 


1 


1 


1.00 


10 


423 


12 


5.98 


9 


1 


1.34 


15 


30372 


544 


100.82 


106 


7 


5.82 


20 


2516347 


38160 


510.60 


390 


7 


9.02 


25 


226460893 


2428352 


15160.50 


16343 


72 


71.37 


o 

CO 


21511212261 


229202163 


175550.00 


235025 


244 


233.80 



Table 2. Some structure statistics collected via the algebras listed in Table 1 



for by Waterman’s formula do not exist in the folding space of a given sequence, 
the majority of structures considered by the Zuker algorithm is redundant, and 
the majority of the feasible structures enumerated by the non-redundant Wuchty 
algorithm is non-canonical. 



Combined Analyses. Since all algebras share the same grammar, a general con- 
struction is available [9] that forms the cross product of two algebras. Everything 
is mechanic except the combined choice function. This means we can, for exam- 
ple, return an optimal solution together with the total number of co-optimal 
solutions. 



Structural Motifs. Retaining the evaluation algebras and the canonical model, 
but specializing the grammar, we obtain the above analyses for all classes of 
structural motifs that can be described by a regular tree grammar. 



Application to Pairwise Sequence Comparison. In many situations, confidence in 
the answer computed by a sequence alignment algorithm could be substantiated 
by reporting the number of (different) co-optimal answers, accompanied by some 
measure of the diversity within the co-optimal answer space. This, again, requires 
canonization, which can be achieved by our approach. For lack of space, we refer 
the reader to the “alignment ambiguity awareness suite” in [9], Chapter 3. 



4 



gaccauacccacuggaaaacucgggaucccguccgcucuccca. . . " . 
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5 Conclusion 

We have provided a framework for reasoning about properties of DP algorithms 
related to ambiguity. We hope to have shown that canonical yield grammars are 
a useful concept both in theory - understanding the properties of DP algorithms 
- and in practice - building reliable and versatile DP algorithms more quickly. 
We expect to apply this approach to further problem domains, as we are working 
to explore the full scope of algebraic dynamic programming. 
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Abstract. In this paper we consider the incremental/decremental ver- 
sion of the edit distance problem: given a solution to the edit distance 
between two strings A and B, find a solution to the edit distance between 
A and B' where B' = aB (incremental) or bB' = B (decremental). As 
a solution for the edit distance between A and B, we define the differ- 
ence representation of the _D-table, which leads to a simple and intuitive 
algorithm for the incremental/decremental edit distance problem. 



1 Introduction 

Given two strings A[1 ..to] and B[l..n] over an alphabet S, the edit distance 
between A and B is the minimum number of edit operations needed to convert 
A to B. The edit distance problem is to find the edit distance between A and 
B. Most common edit operations are the following. 

1. change', replace one character of A by another single character of B. 

2. deletion', delete one character from A. 

3. insertion: insert one character into B. 

A well-known method for solving the edit distance problem in 0(mn) time 
uses the D-table cnn]. Let D{i,j), 0 < i < m and 0 < j < n, be the edit 
distance between A[l..i] and B[l..j]. Initially, D{i,0) = i for 0 < i < m and 
D(0,j) = j for 0 < j < n. An entry D(i^j)^ 1 < i < m and t < j < n, oi 
the H-table is determined by the three entries D{i — l,j — 1), D{i — l,j), and 
D{i,j — 1). The recurrence for the H-table is as follows: For all 1 < i < m and 

1 < j < n, 

D{iJ) = mm{D{i -IJ -1)+ S^j,D{i - l,j) + l,D{i,j - 1) -h 1} (1) 

where Sij = 0 if A[i] = B[j]', 5ij = 1, otherwise. 

In this paper we consider the following incremental (resp. decremental) ver- 
sion of the edit distance problem: given a solution for the edit distance between 
A and B, compute a solution for the edit distance between A and aB (resp. B' 
where B = bB'), where a (resp. b) is a symbol in S. By a solution we mean some 

* This work was supported by the Brain Korea 21 Project. 
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encoding of the H-table computed between A and B. Since essentially the same 
techniques can be used to solve both incremental and decremental versions of 
the edit distance problem, we will consider only the decremental version. 

The incremental/decremental version of the edit distance problem was first 
considered by Landau et al. jS!- They used the C-table |2l4lbl7liij (represented 
with linked lists) as a solution for the edit distance between A and B. Given 
a threshold k on the edit distance, their algorithm runs in 0(k) time. (If the 
threshold k is not given, it runs in 0{m + n) time.) However, the result in j2j is 
quite complicated. 

As a solution for the edit distance between A and B, we define the difference 
representation of the H-table (Uii-table for short). Each entry DR{i,j) in the 
DR-tahle between A and B has two fields defined as follows: For 1 <i <m and 

1 < j < «, 



1. DR{i,j).U = D{i,j)-D{i-l,j) 

2. DRit,j).L = D{i,j)-Diz,j-l) 

A third field DR{i,j).UL, which is defined to be D{i,j) — D{i— l,j— 1), will be 
used later, but it need not be stored in DR{i,j) because it can be computed as 
DR{i,j).U + DR{i— l,j).L. Because the possible values that each of DR{i,j).U 
and DR{i,j).L can have are —1,0, and 1 0, we need only four bits to store an 
entry in the DR-tahle. It is easy to see that the D-table can be converted to the 
DR-tahle in 0{mn) time, and vice versa. We can also compute one row (resp. 
column) of the D-table from the DR-table in 0(n) (resp. 0{m)) time. 

In this paper we present an 0{m + n)-time algorithm for the incremen- 
tal/decremental edit distance problem. Our result is much simpler and more 
intuitive than that of Landau et al. |2|- A key tool in our algorithm is the change 
table between the two D-tables before and after an increment/decrement. The 
change table is not actually constructed in our algorithm, but it is central in 
understanding our algorithm. 

Our result finds a variety of applications. To verify whether a string p is an 
approximate period of another string x where \x\ = n and \p\ = m, one needs to 
find the edit distance between p and every substring of a; |S! ■ A naive method that 
computes a Z3-table of size 0{m^) for each position of x will take 0{m^n) time, 
but our algorithm reduces the time complexity to 0{mn) jO]. Other applications 
include the longest prefix match problem, the approximate overlap problem, the 
cyclic string comparison problem, and the text screen update problem |3]. 

This paper is organized as follows. In section 2, we describe the important 
properties of the change table. In section 3, we present our algorithm for the 
incremental/decremental edit distance problem. 



2 Preliminary Properties 

Let E he a finite alphabet of symbols. A string over if is a finite sequence of 
symbols in E. The length of a string A is denoted by |A|. The z-th symbol in 




62 



Sung-Ryul Kim and Kunsoo Park 



A is denoted by A\i] and the substring consisting of the j-th through the j-th 
symbols of A is denoted by A[i..j\. 

Let A and B be strings of lengths m and n, respectively, over S, and let 
B' = B\l..n]. Let D be the Z3-table between A and B and let D' be the D-table 
between A and B' . Also let DR be the DR-tab\e between A and B and let DR! 
be the DR-tdh\e between A and B' . In this section, we prove the key properties 
between D and D' that enables us to compute efficiently DR! from DR. 
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Fig . 1. An example C/i-table 



One key tool in understanding our algorithm is the change table (C/i-table 
for short) from D to D' . Later, when we compute DR! from DR, the first column 
of DR is discarded and each entry DR{i,j + 1), 0 < i < m and 0 < j < n, will 
be converted to DR'{i,j). Thus, each entry in the Ch-table Ch from D to D' is 
defined as follows: 

Ch{i,j) = D'{i,j) - D{i,j + 1). 

The C/i-table is not actually constructed in our algorithm because the initial- 
ization of the C/i-table will require 0(mn) time. It will be used only for the 
description of the algorithm. See Fig.0for an example Ch-table. 

Figure n suggests a property of the Ch-table: the entries of value —1 (resp. 
1) appear contiguously in the upper-right (resp. lower-left) part of the Ch-table 
in a staircase- shaped region. This property is formally proved in the following 
series of lemmas. 

Lemma 1. In the Ch-tahle Ch, the following properties hold. 

1. Ch(0, j) = —1 for all 0 < j < n. 

2. Ch(i, 0) = 0 for all 1 < i < k, where k is the smallest index in A such that 
A[k] = B[l]. 

3. Ch{i, 0) = 1 for all k < i < m. 

Proof. Immediate from the definition of the D-table. 

Lemma 2. For 1 < i < m and I < j < n, the possible values of Ch(i,j) are 
in the range min{Ch(i — I, j — I), Ch{i — 1, j), Ch{i, j — I)}.. max{Ch(i — l,j — 
l),Ch{^-l,]),Ch{^,J-l)}. 
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Proof. Recall that Ch{i,j) is defined to be D'{i,j) — D{i,j + 1). By recurrence 
(1), D{i,j + 1) is 

min{D(i - 1, j) + Stj+i,D{i - l,j + 1) + + 1}. (2) 

Also, D'{i,j) is min{_D'(i — 1, j — 1) + — 1, j) + 1, D’{i,j — 1) + 1} where 

= 0 if A[i] = otherwise. Because B'[j] is the same symbol as 

B[j + 1], Hence, 

[ D{i - 1, j) + Ch{i - 1, j - 1) + 

D’{iJ) = min < D{i - 1, j + 1) + Ch{i - l,j) + 1 (3) 

[ D{i,j) +Ch{i,j - 1) + 1. 

Note that the only differences between (2) and (3) are additional terms Ch{i— 
l,j — l),Ch{i — l,j), and Ch{i,j — 1) in (3). Assume without loss of generality 
that the second argument is minimum in (2). If the second argument is minimum 
in (3), the lemma holds because Ch{i,j) = Ch{i — l,j). Otherwise, assume 
without loss of generality that the third argument is minimum in (3). Then 
Ch{i,j) = D{i,j) + Ch{i,j—1) + 1 — {D{i— 1, j+ 1) + 1)) > Ch{i,j — 1) because 
the second argument is minimum in (2). Also, Ch{i,j) < Ch{i — l,j) because 
the third argument is minimum in (3). 

Corollary 1. The possible values of Ch{i, j) are —1,0, and 1. 

Proof. It follows from Lemmas ID and O 

Lemma 3. For each 0 < i < m, let f(i) be the smallest integer j such that 
Ch{i,j) = -1. (f{i) = n ifCh{i,f) ^ -1 /or 0 < j' < n.) Then, Ch{i,j') = -1 
for all f(i) < j' < n. Furthermore, f(i) > f{i — 1) for 1 < i < m. 

Proof. We use induction on i. When i = 0, f{i) — 0 and the lemma holds 
by Lemma [D Assume inductively that the lemma holds for i = k. That is, 
Ch{k,f) yf —1 for 0 < /' < f{k) and Ch{k,j') = —1 for f{k) < j < n. 

Let Ch{k+1, 1) be the first entry in row k+1 that is —1. For Ch{k+1, 1) to be 
— 1, at least one of Ch{k, I — 1) and Ch{k, 1) must be —1 by LemmaEl Thus, we 
have shown that I = f{k + l) > f{k). It is easy to see that Ch{k+ 1, ^') = — 1 for 
f{k+l) < I' < nhy the inductive assumption, the condition that /(fc+1) > f{k), 
and Lemma El 

The following lemma is symmetric to LemmaEland it can be similarly proved. 

Lemma 4. For each 0 < j < n, let g{j) be the smallest integer i such that 
Ch{i,j) = 1. (g{j) = m + 1 if Ch(i' ,j) yf 1 for 0 < i' < m.) Then, Ch{i' ,j) = 1 
for all g{j) <i'<m. Furthermore, g{j) > g{j — 1) for 1 < j < n. 

We say that an entry Ch{i,j) is affected if the values of Ch{i — l,j — l), Ch{i — 
l,j), and Ch{i,j — 1) are not the same. We also say that DR’{i,j) is affected if 
Ch{i,j) is affected. 
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Lemma 5. If DR'{i,j) is not affected, then DR'{i,j) equals DR{i,j + 1). 

Proof. If DR'{i,j) is not affected, then the value of Ch{i,j) is the same as the 
common value of Ch{i — l,j — l),Ch{i — l,j), and Ch{i,j — 1) by Lemma 0 
Then DR'{i,j).U = D' {i, j) - D' {i - 1, j) = D[i, j + 1) + Ch{i, j) - {D{i - I, j + 
1) + Ch{i — 1, j)) = DR{i,j + 1).U. Similarly, DR'{i,j).L = DR{i,j + 1).L. 

We say that an entry Ch(i,j) is a {—l)-houndary (resp. 1-houndary) entry if 
Ch{i,j) is of value —1 (resp. 1) and at least one of Ch{i,j — 1), C/i(z+ 1, j), and 
Ch{i + 1, j — 1) (resp. Ch{i,j + 1), Ch{i — 1, j), and Ch{i — 1, j + 1)) is not of 
value —1 (resp. 1). 

By Lemma El we can conclude that in computing DR' from DR, only the 
affected entries need be changed. See Fig. d again. Because the entries whose 
values are —1 (or 1) appear contiguously in the C/i-table, the affected entries 
are either (—1)- or 1-boundary entries themselves or appear adjacent to (— 1)- 
or 1-boundary entries. The key idea of our algorithm is to scan the (—1)- and 
1-boundary entries starting from the upper-left corner of the DR-tahle when we 
compute the affected entries. Lemmas 0 and d imply that the number of (— 1)- 
and 1-boundary entries in the DR-tahle is 0{m -|- n). 

3 Boundary Scan Algorithm 

In this section we show how to compute DR' from DR. First, we describe how we 
scan the boundary entries starting from the upper-left corner of the DR'-tahle 
within the proposed time complexity. Then, we will mention the modifications 
to the boundary-scan algorithm which leads to an algorithm that converts DR 
to DR' . 

For simplicity we will use the C/i-table in the description of our algorithm. 
However, the Ch-tahle is not explicitly constructed but accessed through the 
one-dimensional tables /() and g{). The details will be given later. 

Lemma 6. 



r -DR{i,j + 1).UL + Ch{i - 1, j - 1) -k Aj+1 
Ch{i,j) = min < —DR{i,j -k 1).U -k Ch{i — l,j) -k 1 
[ -DR{i,j + 1).L + Ch{i,j - 1) -k 1 

(i.e., Ch{i — l,j — l),Ch{i — l,j),Ch{i,j — 1), and DR{i,j -k 1) are needed to 
compute Ch{i,j)). 

Proof. Recall that Ch{i,j) = D'{i,j) — D{i,j -k 1). Substituting recurrence (1) 
for D'(i,j) and distributing D(i,j-\-l) into the min function, we have Ch{i,j) = 
min{. . . , D' (i — l,j) — D(i,j -k 1) -k 1, . . .} (only the second argument is shown). 
Substituting D{i — 1, j -k 1) -k Ch{i — l,j) for D' (i — l,j), the second argument 
becomes D{i — l,j -k 1) — D{i,j -k 1) -k Ch{i — 1, j) -k 1 = —DR{i,j -k 1).U -k 
Ch{i — l,j) -k 1. The lemma follows from similar calculations for the first and 
the third arguments. 
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Algorithm 1 

Let k be the smallest index in A such that A[k] = 
(i-i,j-i) ^ (0, 1); (*i,ii) ^ (fc,0); /(O) ^ 0; g(0) ^ k 
finished-i <— false 
finishedi <— false 

while not finished-^ or not finished^ do 
if i_i < ii — 1 then {Case 1} 

Compute Ch{i-i + (See Fig. 21} 

if Ch(i-i + 1, i-i) = —1 then 
i_i ^ i_i + 1; /(i-i) ^ j-i 
else 

j-i ^ i-i + 1 

fi 

else if ji < j-i — 1 then {Case 2} 

Symmetric to Case 1. 

else {Case 3, ii = i_i + 1 and ji = j-i — 1 } 

Compute Ch{i-\ + 1, j_i). {See Fig. El} 
if Ch(i-i + 1, j-i) = —1 then 

i-i ^ i-i + 1; ii ^ ii + 1; /(i-i) ^ i-i 
else if Ch{i-i + = 1 then 

j-i ^ j-i + 1; ji ^ ji + 1; g{ji) ^ n 

else 

j-i <— j-i + 1; ii ^ ii + 1 

fi 

fi 

if i_i = m or j-i = n then finished-i <— true fi 
if ii = m + 1 or = n — 1 then finishedi ^ true fi 



Fig. 2. Algorithm 1 



Algorithm 1 is the boundary-scan algorithm. In the algorithm, the pair 
(resp. (ii, ji)) indicates that j_i) (resp. Ch(ii,ji)) is the cur- 

rent (— l)-boundary (resp. 1-boundary) entry that is being scanned. The follow- 
ing property holds for j_i) and Ch{ii,ji) by Lemmas 0and0 See Fig.0 

for an illustration. 

Property 1. 

1. Ch{i,j) yf — 1 if j and j < j-\. 

2. Ch{i,j) yf 1 if i < ii and j > ji. 

In one iteration of the loop in Algorithm 1, one or both of the current boundary 
entries are moved to the next boundary entries. For example, the current (— 1)- 
boundary entry is moved to the next (— l)-boundary entry which can be down 
or to the right of the current (— l)-boundary entry. We maintain the following 
invariants in each iteration of Algorithm 1. 
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(a) 




Fig. 3. Boundary entry conditions 



Invariant 1 

1. i-i < ii and j_i > ji. 

2. All values of /(O), . . . , /(i-i) are known. 

3. All values of g(0), . . . , g{ji) are known. 

One iteration of Algorithm 1 has three cases. Case 1 applies when the current 
(— l)-boundary can be moved by one entry (down or to the right) without vio- 
lating Invariant 1.1. Case 2 applies when the current 1-boundary can be moved 
by one entry (down or to the right) without violating Invariant 1.1. Case 3 ap- 
plies when moving the (— l)-boundary entry down by one entry or moving the 
1-boundary entry to the right by one entry will violate Invariant 1.1, and thus 
both boundary entries have to be moved simultaneously. What Algorithm 1 does 
in each case is described in Fig. |2 




Fig. 4. Case 1 
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What remains to show is the methods to obtain the values of the C/i-table 
entries that are used to compute a new C/i-table entry, e.g., Chii^i + 1, j_i) in 
Case 1. The two subcases for Case 1 are depicted in Fig.0 The first subcase is 
when j_i > ji + 1. See Fig.0 (a). The unknown values of the C/i-table entries 
are X and Y. By Invariant 1.2 the value of f{i-i) is known. If /(z-i) < j-i, then 
X = —1. Otherwise (/(i_i) = j-i), X = 0 because X is not 1 by Property 1.1. 
It is easy to see that Y = Q because Y is inside the region in which there are 
no (— l)’s (by Property 1.1) and no I’s (by Property 1.2). The second subcase 
is when = ji + 1. See Fig. 0 (b). We can compute the value oi X as — 1 
if f{i-i) < j-i; 1 if g{ji) < i-i; 0, otherwise. We know that Y — 1 by 
Property 1.1. Thus, F = 1 if g{ji) < i-\ + 1; Y = 0, otherwise. Case 3 is 
depicted in Fig. El The value of X can be computed as we computed the value 
of X in the second subcase of Case 1. 




Fig. 5. Case 3 



We now show that all affected C/i-table entries are computed by Algorithm 1. 
It is easy to see that each affected entry Ch{i,j), 1 < i < m and 1 < j < n, falls 
into one of the following types by Lemmas 0 and0 For each of the types we can 
easily check which cases in our algorithm compute Ch{i,j). 

1. Ch{i,j) is a (— l)-boundary entry such that Ch{i,j — 1) ^ —1: Ch{i,j) is 
computed by Case 1 if Ch{i,j — 1) = 0; by Case 3, otherwise. 

2. Ch{i,j) is an 1-boundary entry such that Ch{i — l,j) ^ 1: Ch{i,j) is com- 
puted by Case 2 if Ch{i — l,j) = 0; by Case 3, otherwise. 

3. Ch{i,j) = 0 and either Ch{i — l,j) = —1 or Ch{i,j — 1) = 1: Ch{i,j) is 
computed by Case 1 if Ch{i,j — 1) = 0; by Case 2 if Ch{i — l,j) = 0; by 
Case 3, otherwise. 

To compute DR' from DR, we first discard the first column from DR. Then, 
we run a modified version of Algorithm 1 . The modifications to Algorithm 1 is to 
compute DR'(i,j) whenever we compute the value of Ch{i,j). Once Ch{i,j) is 
computed using Lemma El the fields in DR'(i,j) can be easily computed. That 
is, DR!{i,j).L = DR{i,j + \).L + Ch{i,j) — Ch{i,j — 1) and DR!{i,j).U = 
DR{i,j + 1).U + Ch{i,j) - Ch{i - 1, j). 
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We can easily check that one iteration of the loop takes only constant time 
and that it increases at least one of i-\, j\ by one. Hence, the time com- 

plexity of our algorithm is 0{m + n). 

Theorem 1. Let A and B be two strings of lengths m and n, respectively, and 
B' = B[2..n]. Given the difference representation DR between A and B, the 
difference representation DR' between A and B' can be computed in 0{m + n) 
time. 
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Abstract. Bounds are given on the size of the parameter-space decom- 
position induced by multiple sequence alignment problems where phy- 
logenetic information may be given or inferred. It is shown that many 
of the usual formulations of these problems fall within the same integer 
parametric framework, implying that the number of distinct optima ob- 
tained as the parameters are varied across their ranges is polynomially 
bounded in the length and number of sequences. 



1 Introduction 

Multiple sequence comparison is among the most basic computational problems 
in biology, serving as an aid to identifying common structure and function. The 
problem is, in a way, simply a generalization of pairwise sequence comparison, 
a question that has been studied extensively (see ^3). However, when three or 
more sequences are involved, new issues arise, making definitions more complex 
and the associated problems harder to solve efficiently One of these new 
facets is the role of evolutionary relationships between sequences. One possibil- 
ity is to disregard these relationships, at least to a certain extent, by comparing 
all sequences against each other — the sum- of -pairs approach Q is one exam- 
ple. Using or attempting to infer evolutionary relationships leads to a host of 
new problems. In one problem, called phylogenetic alignment, the input is a tree 
whose leaves are labeled by sequences and the objective is to find a labeling 
of the internal nodes that minimizes the total length of the tree, which is the 
sum of the (evolutionary) distances between adjacent sequences in the tree fj. 
In another problem, generalized phylogenetic alignment, the input is a set S of 
sequences and one must find a sequence-labeled tree of minimum length wherein 
each element of S labels a distinct leaf in the tree Sum-of-pairs multi- 
ple alignment, phylogenetic alignment, generalized phylogenetic alignment, and 
some of their variants are known to be NP-hard ^3^3- However, the first two 
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can be solved in time polynomial in the lengths of the sequences and exponential 
in their number. Implementations of several of these methods, often relying on 
heuristics, are available 

The optimum solution to a multiple alignment problem depends on the vari- 
ous parameters used to compute inter-sequence distance or similarity — e.g., the 
weights of mismatches and spaces. These are not always easy to choose One 
approach is parametric analysis i.e., to examine the space of all parameter 
choices to determine the set of all optimum solutions. This question has been 
studied for pairwise sequence comparison^^^^^^^^^^^J, to a lesser extent 
for sum-of-pairs multiple alignment hardly at all for phylogenetic 

alignment — see, however, [J. Here we explore parametric multiple alignment 
and phylogenetic alignment. 

One objective of parametric analysis is to establish upper bounds on the 
number of distinct optimality regions, i.e., maximal connected regions of the 
parameter space such that, within each region a single solution is optimal. Gus- 
field et at. obtained the first results, proving, among other things, a 
upper bound on the number of regions for two-sequence global alignment, the 
shorter of which has length n. This was extended to obtain a up- 

per bound for the parametric sum-of-pairs alignment of k sequences of length n 
^3- III Ills present paper, we argue that many multiple alignment schemes fall 
within the same “integer parametric” framework. This leads to the somewhat 
surprising result that the number of optimality regions for these problems is 
polynomial in both the lengths of the sequences and their number, even if the 
scoring is alphabet-dependent. Our bounds are consequences of the following 
observation: While the number of potential phylogenies and sequences labeling 
them is exponentially large, any scoring system based on linear functions whose 
coefficients are themselves functions of discrete features of alignments (e.g, num- 
ber of mismatches, spaces, etc.), only allows a polynomially-bounded number 
of distinct cost functions to be optimal. The techniques used are uniform and 
straightforward; once the problems are formulated properly, the common struc- 
ture becomes evident. Better bounds might be obtainable by a tighter analysis of 
our framework; however, we suspect that significant advances will require deeper 
understanding of the combinatorial structure of the individual problems. 

Main Results. The alignment problems studied here are classified according to 
whether the scoring is (i) local or global, (ii) distance-based or similarity-based, 
(iii) alphabet-dependent or -independent, (iv) gap-length dependent or not. The 
input consists of k sequences, which, for simplicity, are assumed to have the same 
length n. Our results include the following bounds on the number of optimality 
regions. 

1 . A bound for global multiple alignment under sum-of-pairs alpha- 

bet-independent similarity and distance scoring, with zero gap penalty, when 
the induced alignment of only p of the ( 2 ) possible pairs is considered for 
computing the total score. 

2. A bound for the previous problem when the gap penalty varies. 

For the pairwise case {p = 1), this improves on earlier O(n^) bound by 
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Gusfield et al. A bound also holds for the local case when the 

gap penalty is zero. For the pairwise case, this improves on a 0 {n^) bound 
by Gusfield et al. 

3. A bound for phylogenetic and generalized phylogenetic align- 

ment under distance-based global alphabet-independent scoring with zero 
gap penalty. The bound goes up to when the gap penalty is 

variable. 

4. A bound for star alignment (a special case of tree alignment) 

under global alphabet-independent distance-based scoring with fixed zero 
gap penalty. This increases to when the gap penalty is allowed 

to vary. 

5. Polynomial bounds for sum-of-pairs, tree alignment, and generalized tree 
alignment problems under alphabet-dependent, global or local, similarity or 
distance scoring. 

Organization of the Paper. The main problems studied here, as well as some of 
their properties, are defined in Section^ Sectionjdiscusses parametric analysis 
in a general context and consists of two parts. First, we obtain an upper bound 
on the number of optimality regions for parametric problems satisfying certain 
integrality conditions. Second, we describe a general approach to generating 
parameter-space decompositions. Parametric multiple alignments are discussed 
in Section^ Section^presents some conclusions and open problems. 

2 Preliminaries 

We now give formal definitions and prove some of the basic properties of the 
problems whose parametric versions we shall study. The first part of this sec- 
tion introduces distance and similarity measures based on pairwise alignments. 
These notions are the basis for the scoring schemes used in multiple sequence 
comparison, which are discussed in the second part. In what follows, E will de- 
note an alphabet that includes a special space character . All input strings 
are assumed to be over A \ {-}. 

2.1 Pairwise Alignments 

An alignment between two strings Si and S2 is a pair of equal-length strings 
A= {S'i,S'2) where S'l and S'2 are obtained by inserting space characters into 
S\ and S2 respectively, so that that there is no character position in which both 
S[ and S'2 have spaces. A match is a position in which S'^ and S'2 have the 
same character. A mismatch is a position in which S'^ and S'2 have different 
characters, neither of which is a . An indel is a position in which one of S'l 
and S'2 has a . A gap is a sequence of one or more consecutive spaces in S'l or 
S'2- Gollectively, we call the matches, mismatches, indels, and gaps the features 
of A. These features are used to compute the value of A according to a certain 
scoring scheme. In local scoring schemes, the goal is to locate highly similar 
substrings. In global schemes, the entire input strings are taken into account. 
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Global Alignment. We first consider alphabet-independent scoring schemes. Let 
XAi VAi and ZA denote, respectively, the number of matches, mismatches, 
indels, and gaps in an alignment A, and let a, f3, and 7 be the mismatch, indel, 
and gap penalties, respectively. Penalties are assumed to be nonnegative. 

The similarity value of an alignment A is given by 



(^A = wa- aXA - PVA - IZA- ( 1 ) 

The global similarity between sequences S\ and S2 is defined as 

sim{Si, S2) = max{cr_4 : ^ is an alignment of and 82}. ( 2 ) 

The distance value of an alignment A is 

^.4 = ax A + PVA + JZA ■ ( 3 ) 

The distance between Si and 82 is 

dist{Si, 82) = min{^_4 : ^ is an alignment of Si and 82}- ( 4 ) 



Schemes ^ and Q are related by the lemma below, in which the mismatch, 
indel, and gap penalties are given by triples (a,/ 3 , 7). 

Lemma 1. Under global alphabet-independent scoring, 

rt _|_ rn 

(TA{a, / 3 , 7 ) = — SA{a + 1 / 2 , 7 ). 

Therefore, a pairwise alignment has maximum similarity score at (a, / 3 , 7) if and 
only if it has minimum distance score at {a 1, (3 1/2, 7). 

Proof. A satisfies 2wa + 2,xa + yA = n-\- m . Thus, the similarity score of A 
y can be re-expressed as 



, ^ , n-\-m f 1\ 

(^A{a,l3,-i) = — (a-k Ija;^- [p+-]y-^z 



n-\- m 



— SA{a + 1,(3 -\- 1 / 2 , 7), 



where the second line follows from the definition of distance score 0 . The rest 
of the lemma follows immediately. □ 



Note that we can assume without loss of generality that the mismatch penalty 
a in H is one, since changing its value only affects the magnitude, but not the 
relative values, of the alignments. Thus, there are effectively only two parameters 
to be chosen for distance scoring. By LemmaH this is also true for similarity 
scoring. 

Alphabet- dependent scoring schemes depend on asymmetric |i 7 | x |i 7 | substi- 
tution matrix a, where a(s, f) is the cost of lining up character s with character 
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t. Widely-used families of matrices for protein alignment are PAM Q and BLO- 
SUM ^3. The similarity score of an alignment A is now given by 

(TA = -^z+ ^ a{s,t) ■ x{s,t) (5) 

{s.t}ci; 

where x{s, t) is the number of times character s is lined up with character t in 
A and z is the number of gaps in A. The similarity between two sequences is 
obtained by applying this scoring scheme in B . 

The alphabet-dependent distance score of A, Sy\, can be defined in the same 
way as Q, except that the yz” term is replaced by “-l-yz.” For both similarity 
and distance, the total number of parameters is + |A'|)/2 -|- 1: the number 
of entries in the substitution matrix plus the gap penalty. 

Local Alignment. For two strings S and R, we write S Q Rii S is a, substring of 
R. The local similarity between and S2, denoted simL{S\, S'2), is 

simL{Si, S2) = max{<T^ : A is an alignment of S[ □ Si with S2 E <S'2}. (6) 

The scoring scheme used in the definition above may be alphabet-dependent 
or -independent. Note that the global similarity between two strings is a lower 
bound on their local similarity. Note also that while it is straightforward to define 
local distance measures, using minimization and a scoring scheme, say, like Q, 
the alphabet-independent versions of distance and similarity are not related by 
LemmaH Thus, even though (alphabet-independent) local distance depends on 
two parameters, local similarity depends on three. For the alphabet-dependent 
case, we still have + \ S \) /2 + 1 parameters. 

Bounds on the Features of Alignments. We will need the following facts about 
pairwise alignment, which were proved in As before Xji, and zji 
denote the number of matches, mismatches, indels, and gaps in an alignment A. 
Let n and m denote the lengths of the input strings, where n < m. 

Lemma 2. For any pairwise global or local alignment A, Wji + Xji < n. 

Lemma 3. For any global and local alignment A, yA,z^ <m + n. Moreover, if 
A is global, yA^xn— n. 

2.2 Multiple Alignments 

A multiple alignment A of strings S\, . . . , Sk, where Si has length Ui, is obtained 
by inserting spaces in each string to obtain strings of the same length 1. The result 
is a matrix with k rows and I columns, such that each character and space of 
each string appears in exactly one column. A induces a pairwise alignment of 
Si and Sj in a natural way: remove all rows of A except those corresponding to 
Si and Sj and strike out any columns containing two spaces. This will be called 
the induced pairwise alignment of Si and Sj. 

The following generalization of two-sequence alignment was considered in 
it is used in the MSA package for multiple sequence alignment 
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Weighted Sum-of-Pairs Alignment (Similarity Version) 

Input: A set of sequences S = {Si, . . . , Sk} and a k x k matrix B = [bij]. 

Question: Find a multiple alignment A for S maximizing bijcr^{Si, Sj), 
where cr^(Si, Sj) is the similarity score of the pairwise alignment between 
Si and Sj induced by A. 

The scoring scheme can be global or local, alphabet-dependent or independent. 
Distance versions of this problem can be defined in the obvious way, using min- 
imization instead of maximization and appropriate scoring schemes. The follow- 
ing is a direct consequence of Lemma J 

Lemma 4. Under global alphabet-independent scoring, a multiple alignment has 
maximum similarity score at (a,/3, 7 ) if and only if it has minimum distance 
score at {a 1, (3 1/2, 7 ). 

Thus, for the global alphabet-independent case, the score is a function of only 
two parameters, the indel and gap penalties. For the local alphabet-independent 
case, the score is a function of two parameters for distance measures and three for 
similarity measures (since the mismatch penalty must be considered, in addition 
to the indel and gap penalties). For the alphabet-dependent case, the score is 
still a function oi {\S\^ -\- \E\) / 2 -\- 1 parameters. 

Gaps raise a problematic issue in multiple alignments. A natural approach is 
to consider as gaps only those that arise in the induced pairwise alignments. How- 
ever, computing optimum alignments in this way would be too time-consuming 
As an alternative, Altschul proposed “quasi-natural” gap penalties Q, a 
scheme later integrated into MSA ^3- We will not elaborate on this issue, except 
to state that it will not impact the parametric analysis of SectionHsignificantly. 

In the next two families of problems, evolutionary history is used and/or 
inferred. We define them for distance measures; similarity versions can be defined 
in the obvious way. Alphabet-dependent or -independent scoring can be used. 

We need some definitions. A phytogeny for a set of sequences 5 is a tree T 
where every internal node has degree at least three, each element of S labels a 
distinct leaf of T, and no leaf of T is unlabeled. An internal labeling for T is 
an assignment of sequences over A \ {-} to the internal nodes of T. The length 
of a tree T whose vertices are labeled by sequences is the sum of the pairwise 
distances between the labels of adjacent nodes. 

Phylogenetic Alignment 

Input: A phylogeny T for a set of sequences S. 

Question: Find an internal labeling for T that minimizes the total length of the 
resulting tree. 

While this problem is NP-complete Q, it can be solved in time that is 
exponential in the number of sequences An important special case is 

star alignment, where the tree T has only one internal node Q. 

Given a solution to the phylogenetic alignment problem, one can derive a 
multiple alignment A for the sequences labeling the phylogeny that is consistent 
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with it in the following sense: The value of the induced pairwise alignment for 
any two sequences equals the distance between them One can obtain a 
multiple alignment for S by striking out the rows of A that do not correspond 
to elements of S. However, the labels on the internal nodes can be valuable, as 
they represent hypothetical ancestors to the elements of S. 

The next problem, which is MAX SNP-hard is related to phylogenetic 
alignment, but the tree itself is not given. 

Generalized Phylogenetic Alignment 

Input: A set of sequences S. 

Question: Find a minimum-length internally-labeled phylogeny for S. 

Phylogenetic alignment and its generalized version have similarity variants 
where the goal is to find a solution that maximizes the similarity score. 

To prevent spurious matches between internal nodes, we make the following 
assumption about internally-labeled phylogenies T: Let A be a multiple align- 
ment consistent with the labels of T. Then, each column in A contains at least 
one character from one of the sequences in S. 

One distinction between phylogenetic and SP alignment is that similarity 
and distance measures are not equivalent in either the local or global cases. 
This equivalence for the SP problem is in part a consequence of knowing in 
advance the lengths of the strings involved in the pairwise comparisons. This is 
not true for phylogeny problems. Thus, for global and local alphabet-independent 
phylogenetic alignment, the score is a function of two parameters for distance 
measures and three for similarity measures. For the alphabet-dependent case, 
the score is a function of + l^l)/2 + 1 parameters. We can, however, still 
establish useful bounds on the number of features. 

Lemma 5. For any alignment A of a set S of k sequences of length n to a 
phylogeny with r internal nodes, wa + Xj\ < nkr and zj, < y^ If nk{k + 2r— 1). 

Proof. Let T be the input phylogeny for S. Each unlabeled node in T is assigned 
a sequence of length at most nk, since each of its characters must line up with 
a character in some sequence in S. By Lemma 5 the total number of matches 
and mismatches in the induced pairwise alignment between any two adjacent 
sequences in T is at most equal to the length of the shorter sequence. Thus, the 
the contribution of an edge in T to the total number of matches and mismatches 
is n if one endpoint is an element of S and at most nk if neither endpoint is in 
S. The total number of edges in the latter category is at most r — 1, while the 
number of edges in the former category is at most k. This establishes the bound 
for WA + XA ■ 

To bound yA and za, we use Lemmafl noting that edges where one endpoint 
is in S contribute at most n + nk indels, while those where neither endpoint is 
in S contribute at most 2nk to the total. □ 

For star alignments, we have a better bound, whose proof is omitted. 

Lemma 6. Let A be an optimal star alignment under alphabet-independent 
distance-based global scoring. Then yA < kn and za < k. 
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3 Parametric Analysis 

In this section, we consider two issues that arise in parametric analysis: finding 
the number of distinct optimal solutions attained as the parameters are varied 
across their range and generating all of these solutions. We study these questions 
within a framework that encompasses a broad class of problems that include 
various alignment problems. We first need some definitions. 

In an ajfine parametric combinatorial optimization problem, the cost of each 
element x of the set X C of feasible solutions is an affine function /x of a 
parameter vector A = (Ai, . . . , Xd)- The fixed parameter problem is to compute 

F(A) = min/x(A). (7) 

We write “min” in the definition above for concreteness; the concepts and results 
to follow have analogs for maximization problems. 

Since it is the lower envelope of a set of affine functions, F is piecewise affine. 
F induces a partition of into d-dimensional convex polyhedral optimality 
regions, such that F{X) is attained by a single function /x for all A in the interior 
of each such region Q. This subdivision of is known as the minimization 
diagram of F. Note that, while a single cost function attains the optimal value 
for each region, there might be several feasible solutions x with the same cost 
function that are co-optimal within the region. 



3.1 The Number of Optimality Regions 



We now prove upper bounds on the number of optimality regions for parametric 
problems of the form ^ where the cost of a feasible solution x = {xq, . . . , Xd) G 
X is given by /x(A) = a;o + Sti Note that the alignment scoring schemes 
described in Sectionjare of this form. 

The following result is implicit in the work of Gusfield et al. 



Lemma 7 . If X C {0,...,iV}^ for some nonnegative integer N, then F{X) 
induces 0{N^^^) optimality regions in R. 



Proof. We rely on the following fact, which is shown in 

(*) Let {oi/6i}i<i<fc be a set of (distinct) irreducible fractions with positive 
numerators and denominators such that Then k = 



Let us denote a feasible solution x by (re, y) and its cost by /x(A) = x + Xy. 
F(X) is a non-decreasing piecewise affine function consisting of a sequence of 
line segments. Hence, if (xi,yi) and (rr^+i, j/i+i) denote the intercept and slope 
of the ith and {i + l)st segments of F, Xi < rc^+i and yi > j/i+i. 

The A-value of the meeting point between the ith and (z -I- l)st segments of 
F is AxijAyi, where Axi = Xi+i - Xi and Ayi = yt - yi+i- Thus, Axi/Ayi < 
Axij Ayi < • • • < Axsj Ays- Since the numerators and denominators of these 
fractions are nonnegative integers and Si=i — -N, (*) applies, 

implying that F has optimality regions. □ 
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Lemma 8. If X is a subset of {0, . . where N is a positive integer, 

then F{\) induces optimality regions in 

Proof. By induction on d. The basis, d = 1, follows from LemmaJ For d > 1, 
define Xj = {x £ X : Xd = j} and hx(A) = xq + Then, we can 

express F as 

F(X) = min ( ( min hXX) ) + Xdj 

By hypothesis, F-{X) = minxgXj ^x(A) induces regions. It can be 

verified that gj{X) = Fj(A) + Adj induces a subdivision of into 
cylinders whose boundary lines are parallel to the A^-axis. Since F is the lower 
envelope of fV+1 such gfs, it induces 0{N ■ optimality 

regions in R'^. □ 

The following observation generalizes a result in Q. 

Lemma 9. Suppose that X C AqX- ■ - x Ad, where each Ai is a set of Ni distinct 
real values. Then, F{X) induces at most ^0^=0 Ni optimality 

regions in R'^. 

Proof. Assume without loss of generality that Nq = maxo<i<d IVi- Consider any 
x,y e X such that F(X') = /x(A') and A(A") = /y(A") for some A', A". If 
xo < 2 / 0 ) then we must have Xi yf yi for some i £ {I,...,d}, for otherwise 
F(X) < fy{X) for all A. Thus, out of all {d + l)-tuples {xq, ■ ■ ■ , Xd) whose last d 
entries are equal, at most one is associated with a feasible solution that is optimal 
at some point. Hence, there are at most Ni functions that are optimal at 
some point, which also bounds the number of regions of A. □ 

3.2 Constructing the Minimization Diagram 

Algorithms for constructing the minimization diagram of parametric problems 
have been proposed before (see, e.g., mostly for the one- and two- 

parameter cases. Here we sketch an approach that appears to be part of the 
folklor(| but deserves to be more widely known. 

An evaluation of F at A consists of finding the solution x £ X such that 
F{X) = /x(A). Evaluating F{X) is equivalent to computing the equation of the 
supporting hyper-plane of the set Bp = {(Ai, . . . , Xd, z) : z < F(A)} at the point 
(Ai, . . . , Xd, F{X)). This operation has been called a hyper-plane probe by Dobkin 
et al. who studied the problem of reconstructing a convex object from a 

sequence of such probes. Constructing Bp (or, equivalently, F) from repeated 
evaluations of F is one instance of this problem. 

Theorem 1 (Dobkin et al. F can be computed with 0(m-|- dv) eval- 

uations, where m and v are, respectively, the number of optimality regions and 
vertices of the minimization diagram. 



^ Naoki Katoh, personal communication 
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For brevity, we omit the details of the probing algorithm, which is explained 
fully in ' -* | . The 1- and 2-parameter algorithms of and ^3 can be viewed 
as special cases. The probing process returns successive elements of a set H of 
half-spaces in whose intersection equals Bp. Actually generating F requires 
computing this intersection. A good practical algorithm to do so is the beneath- 
beyond method whose run time is 0(s|F7|), where s is the size of the output. 
It is a consequence of the Upper Bound Theorem Q that if there are m optimality 
regions, s = leading to a worst-case bound of 

4 Parametric Multiple Alignments 

We now present upper bounds on the number of optimality regions for the mul- 
tiple alignment problems of Section ^ To a certain degree, the results are in- 
dependent of the kind of alignment problem we are dealing with, as long as we 
have bounds on the number of distinct values for the coefficients of the objective 
functions. The arguments are similar: We first show how the problem falls within 
the scope of Lemmas ^Jor^ and then invoke the appropriate bound. 

4.1 SP Alignments 

Our first results concern sum-of-pairs (SP) alignments. We assume that the 
weight matrix is fixed. We first consider 0-1 weight matrices, a problem we refer 
to as 0-1 SP alignment. For the next three results, p denotes the number of 
non-zero entries in the weight matrix. Given an multiple alignment A, Wij, xij, 
i/ij, and Zij denote the number of matches, mismatches, indels, and gaps in the 
induced pairwise alignment for sequences i and j. By Lemmas J and J each of 
these values is at most 2n. Thus, 

0 < ^ bijWij, ^ bijXij, ^ bijUij, ^ bijZij < 2np (8) 

i<i<j<k 



Theorem 2. The number of optimality regions for alphabet-independent para- 
metric 0-1 SP alignment under global similarity or distance measures is 

(a) 0[nf^^p^l^') if the gap penalty is zero and 

(b) if the gap penalty is variable. 

Proof. By LemmaHit suffices to consider distance measures. In this case, the 
distance value of a multiple alignment A is 

^.4 ~ 'y ^ bijXij -\- fd y ( bijyij Tq y ) bijZij^ (9) 

l<i<j<k 



Now, parts (a) and (b) follow by applying Lemma | for d = 1 and d = 2, 
respectively, with N = 2np, where the latter is valid by H 
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Theorem 3. The number of optimality regions for alphabet-independent para- 
metric 0-1 SP alignment under local similarity measures is: 

(a) if the gap penalty is zero and 

(b) if the gap penalty varies. 

Proof. The total similarity score of a multiple alignment A is given by 

^ ^ bijWij o ^ ^ bijXij fd ^ ^ T ^ ^ 

i<i<j<k 

( 10 ) 

Now, parts (a) and (b) follow from applying Lemmajfor d = 2, and d = 3 and 
N = np. □ 

Theorem 4. The number of optimality regions for alphabet-dependent para- 
metric global or local 0-1 SP alignment under distance and similarity measures 
is 

Proof. By equation the value of an alignment A is: 

^^ = 7 X! + X! ^ bijXij{s,t) (11) 

{s, ilex' 

where Zij and Xij{s,t) are, respectively, the number of gaps and the number 
of times character s is lined up with character t in the pairwise alignment be- 
tween strings i and j induced by A. The claim follows from Lemma H since 
Xi<i<j<fc ^b®b('®’ ^) 0{np) distinct values. □ 

We now consider a two-parameter problem whose pairwise version was stud- 
ied earlier by Gusfield et al. in an attempt to analyze the trade-offs between 
match, mismatch, and indel penalties under alphabet-dependent scoring, when 
the substitution matrix is fixed. Given a multiple alignment A, define the score 
of the induced pairwise alignment for sequences i and j as 

= X! a{t,t)xij{t,t), MSij{A)= ^ a{s,t)xij{s,t), 

tGS\{-} s,tGS\{-},s^t 

Sij{A) — y ) a{t,-)xij{t,-). 

teAf-} 

The total score of A is the sum of the pairwise scores: 

- p'Y^Sij. ( 12 ) 

i<j i<j ij 

We refer to the problem of finding a maximum-score alignment under the above 
scoring scheme as the SP trade-off problem. Gusfield et al. proved a sub- 
exponential bound on the number of optimality regions encountered in traversing 
the (A, ^)-plane along any line. For the case where the entries of the substitution 
matrix are small integers, as is often true for PAM and BLOSUM matrices used 
in practice, we can prove a better bound. 
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Theorem 5. Suppose that for s,t S S, a{s,t) G Z and |a(s,i)| < U , U G 1. 
Then, the total number of optimality regions induced by the SP trade-off problem 
on the {X, p) plane is 

Proof. Follows from Lemma H since J2i<j J2i<j 

0{nUk^). □ 

Finally, consider the situation where the weights are arbitrary real values. 

Theorem 6. The number of optimality regions for global or local parametric 
SP alignment is ^ for the alphabet-independent case and ^ for the 

alphabet- dependent case under similarity and distance measures. 

Proof. We consider only the alphabet-dependent case; the other cases are similar. 
The cost of a feasible solution is given by The claim now follows from 

LemmaH since J2i<j bijXij{s, f) can take on ^ distinct values and there are 
0(|I7p) parameters. □ 

4.2 Phylogenetic Alignments 

Abusing terminology, we shall call a feasible solution to the phylogenetic align- 
ment problem a phylogenetic alignment (or, simply, an alignment). An alignment 
will be viewed as consisting of both an internal labeling for the input phylogeny 
T and, for each edge T a pairwise alignment between the sequences labeling its 
endpoints. For the generalized case, in addition to the above, a feasible solution 
(also called an alignment) will also consist of a phylogeny. 

Theorem 7. The number of optimality regions for parametric phylogenetic and 
generalized phylogenetic alignment under alphabet-independent scoring is 

(a) under the distance measure if the gap penalty is zero, 

(b) under the distance measure if the gap penalty is allowed to 
vary. 

(c) under the similarity measure if the gap penalty is held at zero, 
and 

(d) under the similarity measure if the gap penalty is allowed to 
vary. 

Proof. By Lemma H the fact that the number of internal nodes of any 
phylogeny for S is at most fc, 0 < x^, y_ 4 , zj^< N = 0{nk^). Under distance 

measures, the total score of an alignment is xj\_ -\- (iyA + IZa- Now, (a) and (b) 
follow from Lemma J with d = 1 and d = 2, respectively. Under similarity 
measures, the score is wa — ax a — PyA ~ IZa- Thus, parts (c) and (d) follow 
from LemmaHwith d = 2 and d = 3, respectively. □ 

Theorem 8. The number of optimality regions for star alignment under global 
alphabet independent scoring is when the gap penalty is fixed and 

when the gap penalty varies. 

Proof. Use Lemmas HandJ with d = 2 and d = 3, respectively. □ 
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Theorem 9. The number of optimality regions for optimality regions for alpha- 
bet-dependent parametric phylogenetic and generalized phylogenetic alignment 
under distance and similarity measures is \ 

Proof. By equation Q, the value of an alignment A is: 

valA=l ^ Zuv+ ^ a{s,t) ^ Xuv{s,t) (13) 

{u,v)^T {s,i}C£)' {u,v)^T 



where T is the phylogeny, Zuv and Xuv{s, t) are, respectively, the number of gaps 
and the number of times character s is lined up with character t in the pairwise 
alignment between the strings labeling nodes u and v of T. The value of A is 
thus a function of 0{\S\‘^) parameters. Moreover, Xuv(s, t) can take on 0{nk‘^) 
distinct values. The claim now follows from LemmaH □ 

5 Discussion 

Since the number of optimality regions for all problems considered here is polyno- 
mial in the length and number of sequences (assuming bounded alphabet in the 
alphabet-dependent case). Theorem ^implies that the corresponding minimiza- 
tion diagrams can be computed with a polynomial number of calls to algorithms 
for the respective fixed-parameter problems. However, the fact that an exact so- 
lution to these problems is needed is a big limitation, since the cost of carrying 
out even a single multiple or phylogenetic alignment is prohibitive, except for 
short sequences. 

In practice, the fixed-parameter problems are often solved heuristically, and 
each such scheme raises its own parameter-sensitivity issues. The approach pre- 
sented here can be used to analyze any procedure that minimizes or maximizes 
a function that has a discrete dependence on the features of pairwise alignments. 
Examples of such approaches are given in Different techniques seem nec- 
essary to analyze heuristics that do not fall in this category; e.g., progressive 
alignment S'lid some of the methods outlined in 
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Abstract. In this paper we present a branch and bound algorithm for 
local gapless multiple sequence alignment (motif alignment) and its im- 
plementation. This is the first program to exploit the fact that the motif 
alignment problem is easier for short motifs. Indeed for a fixed motif 
width the running time of the algorithm is asymptotically linear in the 
size of the input. We tested the performance of the program on a dataset 
of 300 E.coZi promoter sequences. For a motif width of 4 the optimal 
alignment of the entire set of sequences can be found. For the more nat- 
ural motif width of 6 the program can align 19 sequences of length 100; 
more than twice the number of sequences which can be aligned by the 
best previous exact algorithm. The algorithm can relax the constraint of 
requiring each sequence to be aligned, and align 100 of the 300 promoter 
sequences with a motif width of 6. We also compare the effectiveness 
of the Gibbs sampling and beam search heuristics on this problem and 
show that in some cases our branch and bound algorithm can find the 
optimal solution, with proof of optimality, when those heuristics fail to 
find the optimal solution. 



1 Introduction 

The function of DNA and protein sequences can often be characterized by the 
presence of important substrings or “motifs”, in the case of DNA sequences 
often corresponding to protein binding sites. This observation has led to the 
field of local gapless multiple sequence alignment (hereafter motif alignment) 
which attempts to find meaningful substrings from a collection of sequences 
which have a common biological function, typically by learning a function which 
scores substrings based on how “good” they are as instances of the motif in 
question. Consensus based methods score commonly appearing substrings higher 
than other substrings, sometimes allowing a small number of mismatches. These 
methods have the advantage that it is easy to devise exact algorithms for small 
motif lengths. is a recent example of this approach. However for poorly 
conserved motifs the number of common substrings grows exponentially with 
the motif length, and since the available data is always limited it is problematic 
to accurately estimate the score of each individual substring. Thus it is common 
to use matrix based or “profile” methods which decompose the scoring function 
into a sum over each position in the substring to be scored. Unfortunately for 
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these matrix based methods there have been no known algorithms which do not 
depend exponentially on the number of sequences n. Indeed the specific problem 
formulation adopted in this work has been found to be NP-hard Q, for 
a general motif width. Furthermore Q show that the problem is APX-hard, 
which implies that, \i P ^ NP, a polynomial time approximation scheme for 
this problem does not exist. The problem is also NP-hard when the constraint 
of finding an optimal discrete alignment is relaxed to that of finding an optimal 
“fuzzy” alignment |). Thus most work in this field has concentrated on finding 
good heuristic algorithms. Beam search and Gibbs Sampling Q have been 
used for the specific problem formulation that this paper adopts, and expectation 
maximization (EM)^ has been used for a related problem. In earlier work, an 
exact branch and bound algorithm has been developed which can solve some 
problem instances with more that 10 DNA sequences Q. However, the worst 
case running time of the algorithm is still exponential in n and can be very long 
in practice. 

In this paper we describe a new branch and bound algorithm which is the 
first matrix based exact method to actively exploit the fact that the alignment 
problem is easier for smaller motif widths. Indeed for a fixed width motif the 
running time of the algorithm has an upper bound, which although quite large, 
is only linear in the number of sequences to be aligned. We refer to our new 
algorithm as the “Tsukuba BB Algorithm” and the older branch and bound 
algorithm Q as the “Berkeley BB Algorithm” . 

This paper is organized as follows: we define a specific optimization problem, 
then describe a search tree for finding optimal alignments and introduce a score 
based bound for nodes in that search tree. Following that, we introduce a tech- 
nique for avoiding redundant search of equivalent paths of the search tree and 
prove its correctness. We then define the concept of a consistent alignment and 
show how that concept leads to a complementary method for pruning branches 
in the search tree. After that we describe our implementation of the algorithm 
and give empirical results comparing its performance to a previous branch and 
bound algorithm, Gibbs sampling, and beam search, on a dataset of E.coZi pro- 
moter sequences. Finally, a generalization of the original optimization problem 
in which some sequences are left out of the alignment is introduced and its use 
is demonstrated on the promoter dataset. We close with a short discussion of 
the results presented. 

2 Problem Definition 

We adopt the following problem formulation: to choose one substring of length 
w from each sequence in a set of sequences such that the score of the chosen 
substrings is maximal. For a scoring function we use the common maximum 
likelihood ratio score. 
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2.1 Maximum Likelihood Ratio Score 



In this section we formally define the maximum likelihood ratio (MLR) score 
of an alignment. For our purposes an alignment ^ is a collection of length w 
substrings from the input sequences, where Aj is the substring taken from the 
jth input sequence. First we define a special case known as the entropy score. 
In words it is simply —1 times the sum of the information theoretic entropy 
of the distribution of characters in each column in the given alignment. Let a 
denote the size of the alphabet, n be the number of input sequences, and F{c^ i) 
the frequency of character c in column i of the alignment. (The matrix F /n is 
sometimes called a profile.) We slightly abuse notation by letting a represent 
the set of characters in the alphabet in the phrase c G cr. The entropy score can 
then be written as: 

F{c, z) ^^^ F{c, i) 

^ ^ n n 

2=1 CG»7 

The MLR Score adds a probability vector B of length cr as a background 
model. Also it is common to add a vector of “pseudocounts” P(1 . . .o). The 
MLR Score, R(A) of an alignment A is defined as: 



EE 

2=1 C^G 



F'{c,t) 

n' 



(log 



F'{c,t) 

n' 



log 5(c)), 



where n' = n + ^ 5(c), 5'(c, i) = F{c, i) + 5(c) 

C^G 

The MLR score is equivalent to the entropy score when a uniform distribu- 
tion is used for the background model and all pseudocounts are set to zero, as 
discussed in and Q. 

The standard definition of the MLR score, given above, sums over columns in 
an alignment and characters in the alphabet. However it will suit our purposes 
to rewrite this score as an equivalent formula that sums over substrings included 
in the alignment. To describe that score we use Aji to denote the ith character 
in the substring Aj, i.e. Aj = Aji . . . Aj^j. We define the score of a substring 
S = s\ ... Sw relative to a frequency matrix F' as: 

1 ^ T^f ( A 

^F' (-S') = — XI (^°S 

n' n' 

2=1 

The MLR score can now be written as: 

5(A) = X tp'iAj) -k ((^X 5(c) (log ^ - log 5(c)) 

n' n' 

j—l cGg 

Note that the second term does not include an Aj term. Therefore we can define 
a prior included score of a substring S as: 

=(p.(s) + £,^p(c)(iogLM_ 

C^G 



log 5(c)) 
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and then rewrite that MLR score as the sum of the score of each substring in an 
alignment. 

n 

^ ^F' ) ■ 

t=i 

Choice of Background Probabilities and Priors. The choice of background 
model will depend on the intended use. In this paper we adopt two common 
choices: a uniform background model and a model based on the frequencies of 
bases in the input. For priors, we used the add one Laplace prior i.e. we set all 
the elements of P to one. 

3 Condensed Search Tree 

The search tree used by the Tsukuba BB algorithm is constructed from a root 
node and input set of sequences as follows: for any substring s of length w found 
in the input sequences, construct a branch from the root to a new node labeled s, 
then recursively construct a search tree from s using s as the root node and the 
same set of input sequences except for the removal of all sequences containing 
s. An example of a search tree is shown in figure^ Note that this search tree 
is condensed in the sense that a single edge can represent multiple occurrences 
of a substring. Thus some alignments cannot be expressed as paths through this 
tree. However we prove that a path to an optimal alignment exists in this tree. 

Theorem 1 A path to any optimal alignment exists in the condensed search 
tree. 

We use some properties of an optimal alignment A* to prove this theorem. We 
denote the MLR score obtained by this alignment as R* and the prior included 
frequency matrix of the alignment by F*' . We define the optimal model score 
M{A) of an alignment as the prior included score of the substring relative to 
F*' summed over the substrings in the alignment, i.e. 

n 

i=i 

(One interpretation of this is that it is the log likelihood ratio of A being gener- 
ated by the model F*' versus the background model B.) We use this definition 
to prove two lemmas. 

Lemma 1. There exists a path in the condensed tree which corresponds to an 
alignment whose optimal model score is at least R* . 

Proof: The proof is by construction. Consider the distinct length w substrings 
contained in the input sequences. Order all the substrings by their prior included 
score relative to F*' . Align all occurrences of the highest scoring substring and 
remove the aligned sequences from consideration. Repeat this procedure with 
the remaining sequences until a full alignment is obtained. This procedure gives 
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the alignment with the maximum optimal model score, which must be at least 
R* since A°p* is a feasible alignment whose optimal model score is R*. 

Lemma 2. The optimal model score of an alignment is never greater than its 
MLR score 

Proof: The optimal model score of an alignment A with a prior included fre- 
quency matrix F' can be written as: 



M(A) = t f y: fiMdog - logB(c)), 

j — 1 i — lcG<T 

The MLR score of A is: 



i—1 c^cr 



( 1 ) 

(2) 



Here we state a well known fact, which can be proved with the method of La- 
grange multipliers. For any given fixed probability vector u of length a, the 
maximum of the sum Ui log Vi maximized over any length a probability vector v 
occurs when v = u. i.e. 



max ^ Ui log Vi = '^ Ui log Ui (3) 

i=l i=l 

Combining equationsJ^Hgives: 

M{A) < R{A) < R* 



Thus any alignment which has an optimal model score of at least R* must in fact 
have an optimal model score of exactly R* and be an optimal alignment. The 
alignment guaranteed to exist by lemma 1. is indeed an optimal alignment, which 
proves the theorem. By condensing nodes, the size of the search tree becomes 
bounded by a function of the alphabet and the motif width. As a substring is 
never repeated in a path in the search tree, each leaf of the search tree represents 
an ordering of some of the length w substrings. Since there are only cr’" possible 
substrings of length w, the size of the tree is 0(((t“’)!). Although this is a large 
number, note that it is independent of the number of sequences n. Thus in the 
limit as w is fixed and n goes to infinity, any algorithm which efficiently searches 
the condensed search tree, e.g. depth first search, requires only constant time 
plus the time needed to read the input, which is linear in n. Of course the size 
of the search tree is also bounded by some functions of n. Namely it is 0(d”), 
where d is the number of distinct substrings of length w occurring in the input 
sequences; d itself is bounded by the number of bases in the input as well as by 



4 Tsukuba BB Algorithm 

The Tsukuba BB algorithm is simply depth first search on the condensed search 
tree with pruning. The pruning criteria guarantees that a path through the tree 
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Fig. 1. A diagram of the search tree for three sequences with a motif width 
of two is shown. Multiple occurrences of a substring represented by a node are 
indicated by the number in parenthesis, as in “tg(2)”. 

which builds an optimal alignment by adding substrings in the best first scoring 
order, e.g. with non-increasing values of t^./, will never be pruned. Note that the 
construction of Lemma 1 . in the previous section guarantees that such a path to 
an optimal alignment exists. 

4.1 Pruning Criteria 

Let L be a lower bound on R*, for example the MLR score of some feasible 
solution. Let a subalignment be defined as an alignment of some of the input 
sequences. Let Sq be the set of substrings contained in a subalignment of 

m < n sequences at some node q of the search tree. Let an extension from q be 
any alignment of n substrings produced by adding substrings from Sq to 

A(9."i), Note that while is a feasible subalignment of the input sequences, 

an extension is not a feasible alignment of the input sequences. Thus 

extensions are not of direct interest. However the following pruning criteria uses 
extensions to advantage. 

Pruning Criteria: Prune a node q if any extension can be found such 

that: 

^(A(9.")) < L 



4.2 Correctness of the Pruning Criteria 

We must show that no node which represents a subalignment of an optimal 
alignment and built in best scoring first order will ever be pruned. 

Theorem 2 Let A* be an optimal alignment having a prior included frequency 
matrix F* . Let q be a node in the search tree representing a subalignment of 
A* built by adding substrings in order of non- decreasing t'p,, values. For any 
extension of q 



R(A(«'")) > R{A*). 
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Proof: From lemma 2. of the previous section we know that the MLR score of 
ig at least as great as its optimal model score, Thus 

it is sufficient to show that > R{A*). Note that: 

m n 

j = l j=m+l 

Where we have divided the sum into two sums by numbering the sequences so 
that the first m sequences are the ones included in the subalignment represented 
by q- 

Likewise R{A*) can be expanded as: 

m n 

R{A*) = J2t'^AA*)+ E 

j=l j=m+l 

The first summation term for M{A^'^'A and R{A*) is the same by the re- 
quirement that the subalignment represented by g is a subalignment of A* . Thus 
the difference is: 

n 

Since the substrings added at q are taken from the substrings in A* in order 
of non-decreasing values, each term of this summation is at least zero. Thus: 

^(^(9.n)) > MiA^^'A > R{A*) 

When priors are not used a more powerful pruning criterion can be used: 
Pruning Criteria (No Priors): Prune a node q, representing a subalignment 
with m total substrings, if any extension A^‘^’'^\ m < i < n can be found such 
that: 

R{A^'^A < L 

We omit the proof of correctness, which is similar to the proof of the pruning 
criteria used with priors given in the previous section. 

Choosing Extensions. As stated above, the score of any extension, R{A^'^''A), 
is an upper bound on the score of alignments which may be obtained by expand- 
ing node q. We use a greedy strategy to try to find an extension which gives a 
low upper bound. We consider all extensions obtained by adding one substring 
from Sq and pick the lowest scoring one. i.e. we pick an extension that minimizes 
i?(A(^’"*+i)). We then consider all possible ways to add one substring to that 
extension to produce an extension with m -I- 2 total substrings, and so on. 

5 Using Canonical Representative Nodes to Reduce 
Redundant Work 

General Idea. The condensed search tree shown in figureBiicludes some redun- 
dancy. This redundancy comes from adding the same set of distinct substrings 
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in different orders. For example node (“at”, “ca”) is equivalent (represents the 
same subalignment) to node (“ca”, “at”). Note that this is not always the case 
for two nodes whose subalignments contain the same distinct substrings. For ex- 
ample node ( “gc” , “tg” ) is not equivalent to node ( “tg” , “gc” ) . This is because 
there is a sequence in the input which contains both “gc” and “tg” and therefore 
the alignment of that sequence differs for the two nodes. Still, in the worst case 
many groups of up to n\ equivalent nodes may be found in the tree. We have 
developed an algorithm which effectively reduces redundant calculations without 
significantly increasing memory requirements. 

Even if a node survives the pruning criteria of the previous section it would 
be correct for us to prune that node if we could guarantee that we would not 
prune some other equivalent node. The general idea of our algorithm is to try to 
expand only a single canonical representative node from any group of equivalent 
nodes. In the following section we describe the algorithm we implemented which 
largely achieves this goal. 

Generation Canonical Nodes. We assume that an ordering has been assigned 
to the set of possible substrings of length w. Before describing the algorithm we 
introduce some additional notation: let I be the set of input sequences and 
/ — {{7i, . . . , Up} denote the input sequences that do not contain any substrings 
from the set {Ui, , Up}. The algorithm for computing canonical nodes, shown 
in figure H takes as input a node U and either outputs U itself or an equivalent 
node with a longer tail of ascending substrings. Where 

Definition 1 The tail of a node U = U\ ■ ■ - Uk is a series Up ■ ■ - Uk, where p is 
the smallest integer such that: 

\/i p < i < k, Ui < Ui+i 

The purpose of this section is to show that: 

Theorem 3 The algorithm shown in Figure^^either outputs U or a node which 
is equivalent to U but has a longer tail than U . 

Proof: The program exits from one of four return statements. The first two 
cannot violate the terms of the theorem because they return U unchanged. The 
third returns V where, 

U=Ui---Ur--- U,U,+i ■ ■ ■ UkUk+i 
V = Ui---Ur--- U,Uk+iU,+i ■■■Uk 

Note that the only difference between U and V is the placement of Uk+i. 
The For loop condition ensures that no substring from {Ut+i, . . .,Uk} occurs to- 
gether with Uk+i in any sequence in I — U\, . . . Ur. Thus U and V are equivalent. 
The fourth returns V where. 



C/ = C/l . . . UrUr+l ■ ■ ■ UkUk+1 

y = C/l . . . UrUk+lUr+1 ■■■Uk 
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In this case also, U and V are equivalent for the same reason. It remains 
to prove that V returned by the third or fourth return statement has a longer 
tail than U. If the algorithm executes past the first If statement the tail of U 
must be just Uk+i- However if the algorithm returns from the third or fourth 
return statements it returns a node V whose tail includes Uk and Uk-i (and 
possibly other nodes as well). This proves the theorem. We close this section 

Canonical) U ) j j Output the canonical representative node of ?7 = U\ ■ ■ ■ Uk+\ 

If Uk+i > Uk Return) U ) // #1. 

If f/fc-i > Uk Return) U ) // #2. 

Assign 7T such that U-k is the last substring in U, other than Uk, which is greater 
than its successor. / / The series ?7.,r+i • • • Hfc is in ascending order. 

If Uk+l < Un + l 
r <— 7T 

Else 

Assign r such that: 

Ur is the last substring in ?7,r+i, ■ ■ ■ ,Uk which is smaller than Uk+i 

For) i = k\ i > r\ i = i — 1 ) j j Note that i is decremented. 

If any sequence in 7 — {U\, . . . , Ur} contains both Ui and Uk+\ 

Let V be the same as U except with Uk+i moved directly after Ui 
Return) V) j j #3. Note that if i = A: here, V = U. 

Let V be the same as U except with Uk+i moved directly after Ur 
Return) V ) j / #4. The series W+i • • • 14+1 is in ascending order. 



Fig. 2. Pseudocode for computing the canonical representative of a node 

by noting that in practice the second return statement, returning U, is called 
infrequently. This is because, as will be seen in the next section, the parent node 
of t/, Ui- ■ - Uk, itself is a node returned from Canonical and therefore Uk-i 
tends to be less than Uk- 

The Tsukuba BB Algorithm Using Canonical Nodes. Here we define the 
canonical node U of a given node U as the node returned from the algorithm 
shown in figure^ Note that theoremHensures that either V = U or the tail of 
V is longer than the tail of U. Thus at least one node (any one with a maximal 
length tail) in a group of equivalent nodes must be its own canonical node. 

The steps taken to decide whether to expand a node U, whose canonical node 
is V, during the depth first search of the Tsukuba BB algorithm are: 

— 1. If 7/ can be pruned by the score based pruning criteria don’t expand 

— 2. Otherwise, if V is the same as U, expand. 

— 3. Otherwise, if all of the ancestors of V in the search tree survive the score 
based criteria, don’t expand. 

— 4. Otherwise, Expand. 
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Note that for step 3. common ancestors of both U and V do not need to be 
checked (we know the ancestors of U pass the pruning criteria) . This algorithm 
guarantees that if any node U is reached that passes the score based pruning 
criteria, either U will be expanded, or an equivalent node with a longer tail will 
be expanded. 

6 Pruning Inconsistent Alignments 

In this section we describe a pruning strategy which com- 
plements the score based bound. The key idea is that in- 
consistent subalignments can be pruned. An example of 
a inconsistent alignment is shown in figure H This align- 
ment is inconsistent in the sense that it favors ’at’ over 
’ac’ in the first sequence, but the opposite is true in the 
second sequence. We believe optimal alignments for rea- 
sonable scoring functions will not behave in this way. In particular for the MLR 
scoring function used in this work, optimal alignments can be produced by inde- 
pendently optimizing the score of the substring picked in each sequence relative 
to the frequency matrix F*'. Of course we do not know F*' before we solve the 
problem. However, if we assume that a subalignment is included in an optimal 
alignment, that subalignment gives us some information about F*' , which in 
turn restricts the number of ways in which sequences can be added to that sub- 
alignment while maintaining consistency. For example if a subalignment aligns 
’at’ in a sequence containing ’ac’ we can conclude that F*’{t, 2) > F*'{c, 2), and 
therefore, for any unaligned sequence which contains a pair (’xt’,’xc’), where ’x’ 
is any base, the score of ’xc’ cannot exceed the score of ’xt’. We describe this 
relationship between ’at’ and ’ac’ as ’at’ dominates ’ac’. 

We exploit this by keeping a cr’" x cr’" binary matrix D_table, which al- 
lows pruning based on inconsistency. This type of pruning is easily added to the 
Tsukuba BB algorithm described so far. First, all of the entries of D_table are 
initialized to false. Then the condensed search tree is descended using D_table 
to prune any inconsistent subalignments which may have survived the score 
based bound. When a subalignment cannot be pruned, relationships of the form 
(’xt’,’xc’) for the newly aligned substring paired with all other substrings oc- 
curring in the newly aligned sequences are added and the transitive closure is 
taken. 

7 Berkeley BB 

For comparison purposes we report the running times of aligning sequences with 
a previous branch and bound algorithm Q, which we refer to as the Berkeley BB 
algorithm. That algorithm uses a non-condensed search tree, which is dependent 
on the order of the input sequences. The bound used by the algorithm is different 
than the score based pruning criteria described here. The worst case time of the 
algorithm is 0(F), but in practice the algorithm runs much faster than a naive 



AT gac 
AC atg 

Fig. 3. The aligned 
substrings shown in 
upper case. 
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enumerative algorithm. We slightly modified the implementation to allow priors 
to be used by adding “dummy sequences” of length w to the input sequences. 
The modified program can add dummy sequences to the beginning or end of the 
input, but adding to the beginning was faster for the cases we compared so we 
report those times. 



8 Results 

Dataset. We used a dataset of 300 E.coli promoter sequences of length 100 
The {a, c, g, t} content of the dataset was 28.2%, 21.2%, 21.2%, and 29.8% 
respectively. Q and give analysis and pointers to the extensive biological 
literature concerning E.coli promoters. 



Exact Methods. This section reports results from a 450 MHz Pentium II 
machine running Linux; Tsukuba BB is a C++ program, while Berkeley BB is a 
C program. For motif widths of up to four we were able to compute an optimal 
alignment of all 300 sequences using Tsukuba BB. Results for a motif width of 
four with the Laplace prior are shown in tablej Results for the Tsukuba BB 
algorithm for a motif width of five are shown in table^ In contrast, with a motif 
width of six and the Laplace prior, Berkeley BB requires 19 hours to align 10 
sequences. The running times for the two algorithms for motif widths of six and 
seven are shown in figure | This figure plots times with the Laplace prior. 



# seqs 


algorithm 


uniform 


input comp. 


7 


Berkeley 


22.4 min 


35.3 min 


7 


Tsukuba 


0.67 sec 


1.07 sec 


300 


Tsukuba 


44.2 sec 


28.8 min 



Table 1. Running times of Tsukuba 
BB and Berkeley BB are shown for 
w=4. The background model was ei- 
ther a uniform distribution or the 
composition of the input sequences. 



# seqs 


priors 


no priors 


40 


4.45 


0.96 


45 


5.50 


1.24 


50 


30.4 


4.65 


55 


50.0 


19.5 



Table 2. Running times in hours 
are shown for w=5, with and with- 
out priors. Tsukuba BB was used 
with a uniform background model. 




Fig. 4. The running times of the Tsukuba BB and Berkeley BB algorithms are 
shown for promoter sequences. The x-axis is the number of sequences. The y-axis 
is the running time in hours. (From left to right) the first two plots show times 
with a uniform background and motif widths of 6 and 7 respectively. The last 
shows times for an input composition background model with a motif width of 
6 . 
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Heuristics. This section reports the results of running two heuristic algorithms 
on the input sequences with a motif width of four, a uniform background, and the 
add-one Laplace prior. The optimal alignment has a score of 6.258, and can be 
computed in 44 seconds with Tsukuba BB. Two heuristics where used, 1. a beam 
search algorithm similar to the one used by but with a beam width of 5000, 
and 2. the Gibbs sampling heuristic The beam search heuristic was coded in 
“C” and the Gibbs sampling program “G-l— h” . The Gibbs sampling program uses 
a round robin scheduling pattern for the choice of sequence to realign at each 
step. It has three modes: No Shifting, Plain Shifting, and Forced Shifting. No 
shifting is the straight Gibbs sampling algorithm. Shifting is the “phase shifting” 
described by Q of up to w — 1 positions to the left or right. Plain shifting and 
forced shifting differ only in how they treat boundary conditions. Plain shifting 
does not consider shifts in which some position(s) cannot be shifted without 
going past the edge of a sequence, while forced shifting considers those shifts by 
simply leaving such positions unchanged while shifting the others. Plain shifting 
becomes less meaningful as the number of sequences grows. This can be seen by 
considering the proportion of alignments which can be shifted one to the left, 
which is {I — w )"/ {I — w + 1)". For I = 100, w = 4, n = 300 this proportion only 
is 4.47%. TableHshows the results of tests run with the different heuristics. The 
beam search is deterministic, but depends on the order of the input sequences, 
so we used 10 different randomly generated orderings of the input sequences 
for the beam search trials. Gibbs sampling is inherently stochastic so we simply 
ran it (for 70, 000, 000 potential sequence realignments) in each mode ten times 
(in each case starting with a randomly chosen alignment). For the two shifting 
modes, shifting was considered once every 1000 iterations of the main Gibbs 
sampling loop. 



Heuristic 


mean score 


best score 


mean time 


Beam 


6.083 


6.242 


57.3 


No 


5.000 


5.077 


49.9 


Plain 


5.014 


5.087 


50.1 


Forced 


5.048 


5.112 


58.7 



Table 3. The average score, best score, and average running times in minutes 
are shown for different heuristics. The averages are over 10 trials with the full 
promoter data set and a motif width of four. “No”, “Plain”, and “Forced” refer 
to the modes of the Gibbs sampling program. 



Tsukuba BB for the ZOOPS Problem. Q proposed a variant of the EM 
problem formulation in which it is assumed that the motif occurs once in some 
sequences, and not at all in the other sequences. In this section we describe a gen- 
eralization of the Tsukuba BB algorithm which uses that general idea and their 
name, ZOOPS, for Zero or One Occurrences Per Sequence. The corresponding 
name for requiring one occurrence in each sequence is OOPS. 
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The modified problem is to find a subalignment of size k < n, present in the 
condensed search tree, that has a maximal MLR score, where fc is a user given 
parameter. Two minor changes to the Tsukuba algorithm allow this generaliza- 
tion. First, nodes whose subalignments contain k or more total substrings are 
not further expanded, and second, the role of n in the pruning criteria is replaced 
by k. 



Pruning Criteria (ZOOPS, No Priors): For the ZOOPS problem, aligning k 
sequences. Prune a node q, representing a subalignment with m total substrings, 
if any extension m < i < k can be found such that: 

< L 

Where L is defined as a lower bound for the optimal score of aligning k sequences. 
When priors are used i must be constrained to be exactly k. 

We investigated the use of this generalization on a truncated version of the 
promoter dataset, in which only the 55 bases from position -50 to position -|-5 
were used. All of the results given in this section were obtained using these 
shorter sequences. We used Tsukuba BB to align 100 of the 300 truncated pro- 
moter sequences with a motif width of six, requiring 54 hours. Figure^ shows 
the substrings chosen in the alignment. FigureHshows a histogram of the start- 
ing positions of the substrings found in the alignment. The occurrences cluster 
nicely around the -10 position which is consistent with the currently accepted 
model of promoter structure. 

25 
20 
15 
10 
5 
0 

-50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 



Fig. 6. A histogram of the start- 
Fig. 5. Substrings aligned with the jng position of the substrings aligned 
ZOOPS model and a requirement when at least 100 sequences out of 

of aligning 100 sequences from the 300 were required to be aligned. The 

shortened promoter dataset. substrings were of width six. 



substring 


^ occurrences 




27 


tagaat 


18 


1 3/t 3/clt 


14 




14 




12 




9 


tagaaa 


6 



w = 6, 100 seqs - 






9 Discussion 

Tsukuba BB is dramatically faster than Berkeley BB for motif widths of up 
to five. For these motif widths pruning based on inconsistency very effectively 
complements the score based pruning. For example, the 29 minutes required to 
align a motif of width four for 300 sequences with an input composition based 
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background model increases to 61 hours when inconsistency based pruning is 
disabled. Unfortunately this pruning technique is much less effective as the motif 
width grows (and the number of sequences to be aligned decreases) . For example, 
the 52 hours required to align a motif of width six for 19 sequences with a uniform 
background only increased to 58 hours when inconsistency based pruning was 
disabled. On the other hand, canonical based pruning is relatively more effective 
for longer motif widths, which is expected since the chance that two substrings 
occur in the same sequence decreases. For example, for a motif width of six and 
a uniform background model the time required to align 14 sequences increased 
by a factor of more than 10 when canonical based pruning was disabled. 

For a motif width of four Tsukuba BB is able to find the optimal alignment 
of 300 sequences in 44 seconds, along with a guarantee of its optimality. The two 
common heuristic algorithms tested here could not find the optimum even once 
after 10 trials of close to one hour each. We do not claim that these are necessarily 
the best heuristics and parameter settings possible, for example a sequence order 
independent version of beam search has been developed recently Q . However we 
believe these results raise the issue of how well heuristic algorithms will scale up 
to larger data sets. We note that both heuristics could often find optimal or near 
optimal alignments in just a few minutes, when the number of sequences was on 
the order of 20 (results not shown) . We do however acknowledge the possibility 
that the heuristics may perform better with longer motif widths. We did not 
evaluate EM because it is a heuristic for a different problem formulation. The 
EM formulation considers the likelihood of the motif model summed over all 
possible alignments, while the formulation adopted in this work considers only 
the likelihood of the most likely alignment (i.e. Viterbi path). We note that the 
significant difference between these two formulations has been widely overlooked 
in the biological literature. 

The results given with ZOOPS were primarily intended to demonstrate how 
Tsukuba BB can be used for the ZOOPS problem formulation. We note that in 
general the algorithm can align many more sequences with the ZOOPS than with 
the basic OOPS problem formulation. Indeed from the results in this paper we 
see that aligning 100 sequences out of 300 with a motif width of six requires about 
the same amount of time as aligning 19 sequences with OOPS. ^3 analyzes a 
dataset in which only approximately one third of the sequences are thought to 
contain the motif (ribosome binding site) of interest; a case in which the ZOOPS 
problem formulation is appropriate. 

Tsukuba BB can currently align most conceivable input datasets for a motif 
width of four and is not too far from being able to do so for a motif width of 
five. Since this approach is new it is likely that some further speed-ups will be 
discovered. Furthermore the algorithm, even when using canonical nodes, is just 
depth first search with essentially no global state (although the lower bound L is 
updated infrequently when better feasible solutions are found, using a stale value 
of L does not effect correctness) . Thus the algorithm is easy to parallelize and will 
benefit fully from hardware advances. This is the best exact method. However 
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at present the number of sequences which can be handled for a motif width of 
six or greater is still limited, especially for the OOPS problem formulation. 



Conclusion. We have presented the first exact algorithm for a matrix based 
scoring function which actively exploits the fact that this generally hard problem 
is easier for a fixed length short motif. In fact the algorithm is asymptotically 
linear in the number of sequences. In practice the algorithm can align more 
sequences that the best previous exact method and in some cases can find guar- 
anteed optimal solutions where reasonable heuristics fail. 



Acknowledgments. Dr. Yutaka Akiyama for careful reading of this manuscript. 
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Abstract. In this paper we study the following problem: Given n strings 
Si, S2, ■ ■ ■ , Sn, each of length m, find a substring ti of length L for each 
Si, and a string s of length L, such that max7=i d(s, ti) is minimized, 
where d(-, •) is the Hamming distance. The problem was raised in Q in 
an application of genetic drug target search and is a key open problem in 
many applications The authors of | showed that it is NP-hard and 
can be trivially approximated within ratio 2. A non-trivial approxima- 
tion algorithm with ratio better than 2 was found in A major open 
question in this area is whether there exists a polynomial time approx- 
imation scheme (PTAS) for this problem. In this paper, we answer this 
question positively. We also apply our method to two related problems. 



1 Introduction 

Let s be a string, without further specification, we suppose it is a string over 
alphabet E = {1, 2, . . . , A}. Denote l{s) as the length of s. Let s and s' be two 
strings of same length. Then d{s, s') denotes the Hamming distance between s 
and s'. Let s and s' be two strings that /(s) > l{s'). s is said to be d-close to 
s' if it contains a substring t of length l{s') such that d{t, s') < d. The closest 
substring problem is defined as: 

Closest Substring Problem. Given a set S = {si, S2, . . . , s„} of strings each 
of length m, and an integer L, find a “center” string s and a substring ti of 
length L for each Sj, minimizing d such that for each 1 < f < n, d(s, ti) < d. 

The closest substring problem was introduced in Q and is a key theoretical 
open problem in applications such as antisense drug design BQ, creating diag- 
onal probes creating universal PCR primers BQ. In these applications, 

one wants to distinguish two sets of strings by one short string, which is close 
to a substring of each string in one set and far from any string in another set. 
For example, in the applications of drug design, one wants to design a drug that 
would kill several closely related pathogenic bacteria while it would be relatively 
harmless to humans. In order to do this, one might look for a short strand of 
nucleic acid sequence which can bind to part of a vital gene of each bacteria 
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yet cannot bind to any part of the genes of humans. This naturally raises the 
following question: 

Distinguishing String Problem. Given a set Sb C E™ of (bad) strings, a set 
Sg C of (good) strings and two thresholds db and dg, find a string x such 
that Sb is dft-close to x for every Sb G Sb, and d{x, Sg) > dg for every Sg G Sg. 

The reason to call Sb “bad” is that it represents the sequences of harmful 
bacteria. Note that when Sg is empty, the distinguishing string problem is actu- 
ally the decision version of the closest substring problem. In contrast, when Sb is 
empty, the problem is the decision version of the farthest string problem studied 
in Q, which seems easier than the closest substring (string) problem and has a 
PTAS, as proved in ^ by a standard technique. In practice, db is usually small 
and dg is usually large. Therefore, to find a string that satisfies the conditions 
for Sb is more decisive than for Sg. We will see this more clearly in Section ^ 
Therefore, the closest substring problem is more crucial in these applications. 
However, this problem behaved so elusive that had to study an easier 

version of its: 

Closest String Problem. Given a set 5 = {si, S 2 , . . . , s„} of strings each of 
length m, find a “center” string s of length m minimizing d such that for each 
I < i < n, d{si, s) < d. 

Analogous to the closest substring problem, the authors of | also introduced 
another related problem, the max close string problem, which tends to find a 
string close to as many “bad” strings as possible: 

Max Close String Problem. Given a set 5 = {si, S 2 , . . . , s„} of strings of 
length at least L, and a threshold c? > 0, find a string s of length L maximizing 
the number of strings Si in S that is d-close to s. 

The closest string problem has been studied widely and independently in dif- 
ferent contexts. In the context of coding theory it was shown to be NP-hard Q. 
In DNA sequence related topics, the authors of | used a standard linear pro- 
gramming and random rounding technique and gave a near-optimal algorithm 
only for large d(super-logarithmic in number of sequences). Then in Q, the au- 
thors gave a | approximation algorithm. They use the linear programming and 
random rounding technique only at O{dopt) positions of the solution, while use 
a certain string in S to approximate the solution at the other L — O{dopt) po- 
sitions. B also presented the | approximation with a similar idea. Later, the 
authors of Q used a more sophisticated method to find the 0{dopt) positions 
and achieved a PTAS for the problem. 

However, the closest substring problem seemed much more elusive. It is cer- 
tainly NP hard as the closest string problem is. It admits a trivial ratio 2 ap- 
proximation as shown in Q. In Q, the authors found a nontrivial approximation 
algorithm, with approximation ratio 2 2i^+r . At 0{dopt) positions of the so- 

lution, they used a random string to approximate the optimal solution for some- 
time. This gains about reduction from the trivial ratio 2. However, when |A| 
is large, e.g. 20 for protein sequences, this improvement is very small. 
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The methods that solved the closest string problem cannot be extended 
straightforward to the closest substring problem. This is because that we do 
not know how to generate the linear programming on the O(dopt) positions. 
Only when the optimal solution such that dopt = 0(log(nm)), one can find the 
best solution at the O{dopt) positions by trying all possibilities, instead of using 
linear programming approach. By doing this, gave a PTAS for the closest 
substring problem when dopt = 0{log{nm)). 

In this paper, combining the methods in Q and a random sampling technique, 
we present a PTAS for the closest substring problem. We will also prove that 
the max close string problem cannot be approximated within ratio for any 
e S (0, |) unless P = NP. At last, as two applications of the above PTAS, we 
give suboptimal solutions to the max close string problem and the distinguishing 
string problem. 



2 A PTAS for the Closest Substring Problem 



Let t be a string of length m. Denote the j-th letter t by t[j] . Let R = {ji , j 2 , . . . , 
jk} be a multiset of positions, where 1 < ji < m. Then t|_R is defined by a length- 
k string t[ji]t[j2] ■ ■■t[jk\- 

First, let us recall the PTAS for the closest string problem in |. Let S = 
■ ■ ■ ,tn} be a set of n strings of length m. Suppose s is the closest string of 
S, i.e., s such that d{s,ti) is minimized. Denote dopt = max"^j^ d(s, t^). 

For any given r > 2, let 1 < zi,i 2 ,...,A < n be r distinct numbers. Let 
Qii,i 2 ,...,ir bs tbe set of positions where agree. The following lemma 

is proved in ^ (Claim 16 and 17): 

Lemma 1. ^ Let po = ma,xi<ij<ndH{ti,tj) /dopt- For any constant r, if po > 
1 + 2fir[i then there are indices 1 < A , *2, ■ ■ ■ Dr < such that for any 1 <l <n, 






ir) - 



^Qi 



.)< 



2r- 1 



^opt ’ 



This lemma ensures the authors of ^ to use to approximate s at the 
positions in Let Q = and P = {1, 2, . . . , m} - Q. It is 

easy to verify that |P| < rdopt = 0{dopt)- At the positions in P, s can be 
approximated by solving the following optimization problem: 



f min d; , , 

\d{ti\p,x) < d- d{ti\Q,ti^\Q), l,---,n;|a;|= |P|. 

Since it can be proved that the optimal solution of the above problem such that 
d = 0(|P|), it can be approximately solved by i) rewriting it to be a zero-one 
optimization problem; ii) solving the relaxed linear program; and iii) random 
rounding. At last, combining the approximate solution x of at P and st^ at 
Q gives a PTAS. 

The algorithm for the closest string problem cannot be extended to the closest 
substring problem straightforward, since we do not know which substring ti 
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(z = 1, 2, . . . , n) to use in the optimization problem It is easy to see that the 
choice of good ti’s is the only obstacle on the way to the solution. Our strategy 
is random sampling. 

Now let us outline the main ideas. Let {S = {si , S 2 , . . . , s„}, L) be an instance 
of Closest Substring, where Si is of length m. Suppose that s is its optimal 
center string and ti is a length L substring of Si which is the closest to s (z = 
1, 2, . . . , zz). Let dopt = d(s, ti). By trying all possibilities, we assume that 

ti^,ti^, . . . , ti^ are the r substrings ti^ that satisfy LemmaJ Let Q be the set of 
positions where . . . , ti^ agree and P = {1,2,...,L} — Q. By Lemma J 

Iq is a good approximation to s|q. We want to approximate s|p by the solution 
X of the following optimization problem Q, where t) is a substring of Si and is 
up to us to choose. 



r min d] 

\d{t'i\p,x) < d- f = 1, . . .,n; |a;| = |P|. 



( 2 ) 



The ideal choice is t) = ti, i.e., t) is the closest to s among all substrings of sp 
However, we only approximately know s in Q and know nothing about s in P so 
far. So, we randomly pick 0(log(mn)) positions from P. Suppose the multiset of 
these random positions is R. Since |i?| = 0(log(mrz)), by trying all possibilities, 
we can assume that s|p is known. We then find the substring t) from s such that 
d(s|p, t-|p) X + Iq, t-lg) is minimized. Then f- potentially belongs to the 
substrings of Si that are the closest to s. 

Then we solve B approximately by the method provided in Q and com- 
bine the solution x at P and at Q, the resulting string should be a good 
approximation to s. 

The more detailed algorithm (Algorithm closestSubstring) is given in Fig- 
ure O We prove Theorem Hin the rest of the section. Before the proof, we need 
a lemma which is commonly known as ChernofPs bounds (Q, Theorem 4.2 and 
4.3): 

Lemma 2. Let Xi,X 2 ,...,Xn be n independent random 0-1 variables, 

where Xi takes 1 with probability pi, 0 < pi < 1. Let X = X)r=i 0 = 

E[X]. Then for any d > 0, 



(1) Pr(A > (l-k(5)Az) < 



(l+5)(i+*) 



(2) Pr(A < (1 — S)p,) < exp (— . 

From LemmaH we can prove the following lemma: 

Lemma 3. Let Xi, X and p be defined as in LemmJ^ Then for any 0 < e < 1, 

(1) Pr(A > p-\- en) < exp (— izze^), 

(2) Pr(A < p — en) < exp (— ine^). 

Proof. (1) Let S = ^. By Lemma^ 



Pr(A > pt-en) < 



e 




e 


en 

< 


e 










[(l + e)i+Tj 
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Algorithm closestSubstring 

Input n sequences {si, S2 , . . . , s„} C A'™, integer L. 

Output the center string s. 

1. for every r length-L substrings (allowing repeats, 

but if tij and are both chosen from the same Si then ) 

of si, . . . , s„ do 

(a) Q = {l<j< L\ti^[j] = ti^[j] = ... = P = {1,2,...,L}-Q. 

(b) Let i? be a multiset containing |" ^ log(nm)] uniformly random 
positions from P. 

(c) for every string y of length |i?| do 

(i) for i from 1 to n do 

Let be a length L substring of Si minimizing d{y, 

IQj ^iIq)- 

(ii) Using the method in Q, solve the optimization problem defined 
by Formula Q approximately. Let x be the approximate solution 
within error e |P|. 

(iii) Let s' be the string such that s'|p = x and s'|q = tijg. Let 

c — max^_-j^ minp^ is ^ substring of s,} , U). 

2. for every length-L substring s' of si do 

Let c = max^_2 minp^ jg a substring of s*} dp(s , ti). 

3. Output the s' with minimum c in step 1(c) (iii) and step 2. 



Fig. 1. The PTAS for the closest substring problem. 



where the last inequality is because y < n and that (1 + a;)*^^'''^^ is increasing for 



re > 0. It is easy to verify that for 0 < e < 1, 



(1) is proved. 

(2) Let (5 = ^. By Lemma] 



(i+0'+ 
(2) is proved. □ 



r ^ 6xp (— §) . Therefore, 



Theorem 1. Algorithm closestSubstring is a PTAS for the closest substring 
problem. 

Proof. Let s be an optimal center string and U be the length-L substring of 
Si that is the closest to s. Let dopt = maxd(s,ti). Let e be any small positive 
number and r > 2 be any fixed integer. Let po = maxi<ij<„d(si, Sj)/dopt- If 
po < 1 + ) then clearly we can find a solution s' within ratio po in step 2. 

So, we suppose po > 1 + 2 r~i from now on. 

For convenience, for any position multiset T and two strings ti and ^ 2 - we 
denote d^(fi, f2) = d(U|r, t2 |T)- By LemmaO Algorithm closestSubstring picks 
a group of ■ I tir ™ step 1 at a certain time such that 

Fact 1. For any 1 < ^ < n, d'^{ti^,ti) - d'^{ti^,s) < 27^ dopt- 

Obviously, the algorithm takes y as s|p for a certain time in step 1(c). Let 
y = s|p and ti^jti^, ■ ■ -,ti^ satisfy Fact 1. Let f' be defined as in step l(c)(i). Let 
s* be a string such that s*|p = s|p and s*|q = UJq. Then we claim: 

Fact 2. With high probability, d{s*,t'f) < d{s*,ti) + 2e|P| for all 1 < i < n. 
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Proof. Let P = |^- For any substring t' of length L of Si such that 

d{s*,t')>d{s*,U) + 2e\P\, (3) 

from the definition of s*, the following inequality is obviously, 

Pr (pd(y, t'lfi) + d^{ti^,t') < pd{y,U) + d^iU^^U)) 

< Pr {pd^{s*,t') + d^{s*,t') < d{s*,t') - e\P\) + 

Pr{pd^{s*,U) + dQ{s*,U)>d{s*,t,)+e\P\). (4) 

It is easy to see that d^{s*,P) is a sum of |i?| independent random 0-1 
variables Xi, i = 1, 2 , . . . , |i?|, where Xi indicates if s* mismatches t' at the z-th 
position in R. Let p = E[d^{s* , t')] . Then obviously, p = d^{s* , t') f p. Therefore, 
by LemmaH 

Pr {pd^{s\t')+d9[s*,t') < d(s*,0 -e|P|) 

= Pr {d^[s*,t') < [d[s*,t')-d9[s*,t'))lp-e\R\) 

= Pr (d^{s* ,t') < p — e|i?|) < exp (5) 

where the last inequality is because of |i?| = |"^log(nm)] by step 1(b) of the 
algorithm. For the same reason, we have 

Pr [pd^{s* ,ti) + d^{s* ,ti) > d{s*,ti) + e|P|) < (nm)“T (6) 

Combining Formula QQQ, we know that for any t' that satisfies Formula Q, 

Pr {pd{y,t'\R) + d^{ti^,t') < pd{y,U) + d^{ti^,ti)) < 2 {nm)~X (7) 

For any fixed 1 < z < n, there are less than m substrings t' that satisfies 
Formula 0. Thus, from Formula ^ and the definition of 

Pr (d(s*,t') > d{s*,ti) + 2e|P|) < . (8) 

Summing up all z € [1, rz], we know that with probability at least 1 — 2 {mn)~i , 
d{s*,t'fj < d{s*,ti) + 2e|P| for all z. □ 

From Fact 1, d{s*,ti) = d^{s,ti)+d^{ti^,ti) < d{s,ti) + ^;^ dopt- Combining 
with Fact 2 and |P| < rdopt, we get 

d{s* ,t'f) < {1 + ■^—^ + 2er)dopt- (9) 

By the definition of s* , the optimization problem defined by Formula Q has 
a solution s\p such that d < (1 -I- 57^ + ‘^^^)dopt- Solving it within error e|P| 
by the method in suppose x is the solution. Then by Formula Q, for any 
1 < z < rz, 

1 



d(t'i\p,x) < (1 -k 



2r- 1 



+ ‘^er)dopt — d(t(|Q,tijQ) -f e|P|. 



( 10 ) 
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Let s' be defined in step 1(c) (iii), then by Formula 

d{s', t'i) = d{x, t'|p) + d{ti, Iq, t'lg) 

< (1 + n y + ‘^f-f)dopt + f-\P\ 

2r — 1 

< (1 + 2^; j- + if-r)dopt- 

It is easy to see that the algorithm runs in polynomial time for any fixed 
positive r and e. For any 5 > 0, by properly setting r and e such that +3er < 
6, with high probability, the algorithm outputs in polynomial time a solution 
s' such that Si is (1 + S)dopt-close to s' for every 1 < i < n. The algorithm 
can be derandomized by standard methods such as the method of conditional 
probabilities Thus, Theorem Jis correct. □ 

3 Approximating the Max Close String Problem 

If we could solve the max close string problem {S,L,d) (in polynomial time), 
then we might try all d = 0,1,2, L and find the least d such that all the 
strings in S are d-close to a string s. It is easy to verify that this d is the optimal 
value of the closest substring problem {S,L). Since the closest substring problem 
is NP-hard, we know that the max close string problem is NP-hard too. Actually, 
we can show a much stronger hardness result: 

Theorem 2. For any 0 < e < j, the max close string problem cannot be ap- 
proximated within ratio n" in polynomial time, unless P = NP. 

Proof. First, let us prove the theorem for the case E = {0,1}. We reduce the 
far from most string problem to it. 

Far from Most String Problem. Given a set S of strings of length m and a 
threshold d > 0. Find a string x of length m maximizing the number of strings 
s in 5 that satisfies d{x, s) > d. 

In Q, it is proved that for any finite size alphabet E, unless P = NP, the far 
from most string problem does not admit an approximation algorithm within 
ratio for any 0 < e < |. Suppose I =< S,d > is one of its instance such 
that E = {0, 1}. Let S = (s | s G 5}, where s is the string gotten by flipping 
every bit of s. We construct an instance of the max close string problem as 
I' =< S,m,m — d>. 

It is easy to see that the following properties are identical: 

1. I has a solution s which is far from k strings in S with distance at least d. 

2. s is close to k strings in S with distance at most m — d. 

3. /' has a solution s which is m — d close to k strings in S. 

Therefore, / is identical to Since the hardness of approximating for the far 
from most string problem, the theorem is correct for A = {0, 1}. 

Now let {0,1} C E. Suppose we have an approximation algorithm with ratio 
p > 1 for the max close string problem over E. Let I be an instance of the 
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max close string problem over {0, 1}. We solve it with the ratio-p approximation 
algorithm over E. Suppose s is the solution. We modify s by replacing every 
character not in {0, 1} by 0 or 1 arbitrarily. Obviously, this will not make the 
solution worse. However, s is a solution over {0, 1} now. It is easy to verify 
that this solution has a ratio at most p. Thus, as the theorem is correct for the 
alphabet {0,1}, it is correct for any alphabet S. □ 

Although the max close string problem is hard to approximate according to 
Theorem H the following theorem shows that we can approximate it in another 
sense. 

Theorem 3. Given an instance of the max close string problem {S,L,d). Sup- 
pose an optimal solution s such that there are k strings in S which contain 
length-L substrings within Hamming distance d to s. For any e! > 0, we can 
find in polynomial time an s' such that there are at least k strings in S which is 
(1 + e')d-close to s'. 

Proof. The algorithm is quite similar to Algorithm closestSubstring, except for 
that in step l(c)(i) we only keep the t'fs for those i’s satisfying d{y, tiln) ^ |^ + 
t'lg) < (1 + 27 ^ + ‘2er)d. Then by the same arguments of Theorem^ 
we know that with high probability, we have at least k length-L substrings t' 
from at least k distinct strings in S and a length-L string s' such that d{s' , t'f} < 
(1 -I- 27 ^ + ier)d. Set r and e so that e' = 57 ^ + 3er. The theorem is proved. □ 

4 Approximating the Distinguishing String Problem 

In this section, we give a suboptimal solution to the distinguishing string prob- 
lem, using the algorithm for the closest substring problem. The idea is as follows: 
Since db is significantly less than dg, whenever we obtain a good solution at the 
“bad” string side, by Triangular Inequality, it is not too bad at the “good” 
string side. In this sense, the closest substring problem is more crucial to the 
distinguishing string problem than the farthest string problem is. 

Theorem 4. For any d > 0, there is an algorithm such that if an instance of 
the distinguishing string problem I =< Sb,Sg,db,dg,L > has a solution s, then 
the algorithm outputs a string s' in polynomial time such that y is a solution of 
I' =< Sb, Sg, (1 -|- e')db, dg — {2 e')db, L >. 

Proof. We invoke Algorithm closestSubstring to solve the closest substring prob- 
lem {Sb, L). However, at step 1(c)(1), whenever i = ij, we let = Ur, at step 3, 
we output the s' satisfying c < {1 d)db and d{s' ,Sg) > dg — {2 -\- e')db for any 
Sg G Sg. We need to show that this s' exists in step 1 or step 2. 

Let s be a solution of I and U be a substring of Si € Sb such that d{s,U) < db. 
Following the proof of Theorem H in the modified algorithm closestSubstring, 
there are t' (i = 1 , 2 , . . . , n) such that t'^. = U. and the s' defined in step l(c)(iii) 
satisfies that 



d{s' , t'f) < (1 -l- e')db. 



( 11 ) 
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Particularly, 

d{s',U.) = d{s',t[.)<{l + e')dk. (12) 

Since d{s,ti^) < db and c?(s, Sg) > dg for any Sg G Sg, by Triangular Inequal- 
ity, we have 

iij) ^ dg db- (1^) 

Further combining with Formula Triangular Inequality, we know that 

d{sg, s') > {dg — db) — (1 -h d)db = dg — {2 + e')db- Combining with Formula Q, 
the theorem is proved. □ 

Remark. By combining the conditions for Sg into the optimization problem Q, 
we in fact can get an algorithm that y is a solution of /' =< Sb, Sg, {l+e')db, dg — 
{1 + e')db, L > in Theorem^ Since the proof is tedious and has no new idea, we 
omit it. 



Acknowledgment 

The author thanks Dr. M. Li and L. Wang for their valuable discussions on the 
topic. 



References 

1. A. Ben-Dor, G. Lancia, J. Perone, and R. Ravi, Banishing Bias from Consensus 
Sequences, Combinatorial Pattern Matching, 8th Annual Symposium, Springer- 
Verlag, Berlin, 1997. 

2. S. Crooke and B. Lebleu (editors). Antisense Research and Applications, CRC 
Press, 1993. 

3. M. Frances and A. Litman, On covering problems of codes. Theory of Computing 
Systems, vol. 30, pp. 113-119, 1997. 

4. L. Gqsieniec, J. Jansson, and A. Lingas, Efficient approximation algorithms for the 
Hamming center problem. Proceedings of the Tenth Annual ACM-SIAM Sympo- 
sium on Discrete Algorithms, pp. 905-906, San Francisco, 1999. 

5. E. M. Hillis, C. Moritz, and B. K. Mable, Molecular Systematics, 2nd ed., Sinauer 
Associates Inc., Sunderland, 1996. 

6. K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang, Distinguish string search prob- 
lems, Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algo- 
rithms, pp. 633-642, San Francisco, 1999. 

7. M. Li, B. Ma, and L. Wang, Finding similar regions in many strings. Proceedings 
of the Thirty-first Annual ACM Symposium on Theory of Computing, pp. 473-482, 
Atlanta, 1999. 

8. B. Ma, Some approximation algorithms on strings and trees, Ph.D. Thesis, Peking 
University, 1999. 

9. A. Macario and E. Macario, Cene Probes for Bacteria, Academic Press, 1990. 

10. R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University 
Press, 1995. 

11. L. Wang, Personal Communication, 1999. 



Approximation Algorithms for 
Hamming Clustering Problems 



Leszek G^sieniec^, Jesper Jansson^, and Andrzej Lingas^ 

^ Dept, of Computer Science, University of Liverpool 
Peach Street, L69 7ZF, UK 
leszekOcsc . liv .ac.uk 

^ Dept, of Computer Science, Lund University Box 118, 221 00 Lund, Sweden 
{Jesper . Janssen, Andrzej .Lingas}@cs . lth.se 



Abstract. We stndy Hamming versions of two classical clustering prob- 
lems. The Hamming radius p-clustering problem (HRC) for a set S' of fc 
binary strings, each of length n, is to find p binary strings of length n 
that minimize the maximum Hamming distance between a string in S 
and the closest of the p strings; this minimum value is termed the p-radius 
of S and is denoted by g. The related Hamming diameter p-clustering 
problem (HDC) is to split S into p groups so that the maximum of the 
Hamming group diameters is minimized; this latter value is called the 
p-diameter of S. 

First, we provide an integer programming formulation of HRC which 
yields exact solutions in polynomial time whenever k and p are con- 
stant. We also observe that HDC admits straightforward polynomial- 
time solutions when k — O(logn) or p = 2. Next, by reduction from 
the corresponding geometric p-clustering problems in the plane under 
the L\ metric, we show that neither HRC nor HDC can be approx- 
imated within any constant factor smaller than two unless P=NP. We 
also prove that for any e > 0 it is NP-hard to split S into at most pfc^A-<= 
clnsters whose Hamming diameter doesn’t exceed the p-diameter. Fur- 
thermore, we note that by adapting Gonzalez’ farthest-point clustering 
algorithm HRC and HDC can be approximated within a factor of 
two in time 0{pkn). Next, we describe a (1 -I- e)- 

approximation algorithm for HRC. In particular, it runs in polynomial 
time when p = 0(1) and g = 0(log(fc -|- n)). Finally, we show how to find 
in 0((j -t- kn log n -h k^ log n)(2^fc)^'^^) time a set L of 0(p log k) strings 
of length n snch that for each string in S there is at least one string in L 
within distance (1 -I- e)£i, for any constant 0 < e < 1. 



1 Introduction 

Let Z 2 be the set of all strings of length n over the alphabet {0,1}. For any 
a G Z 2 , we use the notation a[i] to refer to the symbol placed at the ith position 
of a, where i G {1, .., nj. The Hamming distance between ai, 02 G Z 2 is defined 
as the number of positions in which the strings differ, and is denoted by d{a\, 02 ) • 

R. Giancarlo and D. Sankoff (Eds.): CPM 2000, LNCS 1848, pp. 108-^^^ 2000. 
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The Hamming radius p-clustering piohlei^(HRC) is stated as follows: Given 
a set S' of fc binary strings at £ Z 2 , where i = and a positive inte- 
ger p, find p strings (3j £ I 2 , where j = minimizing the value g = 

max min d(ai,B^). Such a set of Bds is called a p-center set of S, and the 

i<i<k i<j<p ^ 

corresponding value of g is called the p-radius of S. Note that an instance of 
HRC can have several p-center sets. 

The Hamming diameter p-clustering problem (HDC) is defined on the same 
set of instances as HRC, and is stated as follows: Partition S into p disjoint sub- 
sets Si, Sp (called p-clusters of S) so that the value of max max d{ai, aj) 

OtiyOtj^Sq 

is minimized. This value is called the p-diameter of S. 

One can immediately generalize HRC and HDC by considering a larger finite 
size alphabet instead of {0, 1}, making the problem more amenable to biological 
applications. However, as long as the distance between two different characters 
is measured as one, such a generalization involves only trivial generalizations 
of our approximation methods. Therefore, we only consider the original binary 
versions of HRC and HDC throughout this paper. 

In Q, Frances and Litman showed that the decision version of the Ham- 
ming radius 1-clustering problem (1-HRC) is NP-complete. Motivated by the 
intractability of 1-HRC and its applications in computational biology, coding 
theory, and data compression, two groups of authors recently provided several 
close approximation algorithms QQ. This was followed by a polynomial-time 
approximation scheme (PTAS) for 1-HRC Q. As for the more general HRC 
and HDC, one can merely find work on the related graph or geometric p-center, 
p-supplier, and p-clustering problems in the literature In the undi- 

rected complete graph case, with edge weights satisfying the triangle inequality, 
all of the three aforementioned problems are known to admit 2-approximation 
or 3-approximation polynomial-time algorithms, but none of them are approx- 
imable within 2 — e for any e > 0 in polynomial-time unless P=NP This 

contrasts with the p = 0(1) case when, e.g., the graph p-center and p-supplier 
problems can be trivially and exactly solved in n^^P'> time. HRC doesn’t seem 
easier than these graph problems. Optimal or nearly optimal center solutions to 
it have to be searched in Z 2 whose size might be exponential in the input size. 
For this reason, HRC is NP-complete already for p = 1. Our results indicate 
that in the general case HRC as well as HDC are equally hard to approximate 
in polynomial time as the p-center or p-clustering graph problems are. 



1.1 Motivation 

Clustering is used to solve classification problems in which the elements of a 
specified set have to be divided into classes so that all members of a class are 
similar to each other in some sense. HRC and HDC are equally fundamental 
problems within strings algorithms as the corresponding graph and geometric 

^ The corresponding graph problem is often termed the p-center problem in the liter- 
ature Q. 
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center and clustering problems are within graph algorithms or computational 
geometry respectively They have potential applications in compu- 

tational biology and pattern matching. 

For example, when classifying biomolecular sequences, consensus representa- 
tives are useful. The around 100000 different proteins in humans can be divided 
into 1000 (or less) protein families, which makes it easier for researchers to 
understand their structures and biological functions Q. A lot of information 
about a newly discovered protein may be deduced by establishing which family 
it belongs to. During identification, it is more efficient to try to align the new 
protein to representatives for various families than to individual family mem- 
bers. Conversely, given a set S' of fc related sequences, one way to find other 
similar sequences is by computing p representatives (where p « k) for S and 
then using the representatives to probe a genome database. The representatives 
should resemble all sequences in S, and must be chosen carefully. For instance, 
when p = 1, the sequence s that minimizes the sum of all pairwise distances 
between s and elements in S is biased towards sequences that occur frequently, 
but using a 1-center as representative will avoid this problen| For p > 1, the 
representatives can be the members in the p-center set or simply p sequences, 
each from a different p-cluster. 

In pattern matching applications, the number of classes p can be large; a 
system for Chinese character recognition, for example, would need to be able to 
discriminate between thousands of characters. 

1.2 Organization of the Paper 

Section J provides polynomial-time solutions for restricted cases of HRC and 
HDC based on integer programming, exhaustive search, and breadth-first search. 
In Section^ we prove the NP-hardness of approximating HRC and HDC within 
any constant factor smaller than two. In the same section, we also prove that 
another type of approximation for HDC in terms of the number of clusters 
is NP-hard. Section J presents three approximations algorithms for HRC and 
HDR: a two-approximation algorithm for HRC and HDC based on Gonzalez’ 
furthest-point clustering method Q, an approximation scheme, i.e., a (1 -I- e)- 
approximation algorithm for HRC, and a (1 -I- e)-approximation algorithm for 
HRC using a moderately larger number of approximative centers. 



2 Polynomial-Time Solutions for Restricted Cases 

The Hamming radius p-clustering problem is equivalent to a special case of the 
integer programming problem. A given instance (oi, ..,afc,P, (?) of the decision 
version of HRC, where ai £ Z 2 for 1 <i <k, and p, g £ N, can be expressed as 
a system oi k-p linear inequalities. 

^ Depending on the application, the difference between strings is sometimes measured 
in terms of edit distance, which also takes insertions and deletions into account, 
rather than Hamming distance, which just considers substitutions. 
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We use two matrices X and V of 0-1- variables. The rows of X correspond 
directly to the p strings that constitute a p-center for the supplied instance, and 
Y is used to make sure that each ai is within distance g of at least one of the 
centers. 

Let AT be a p x n-matrix of variables Xjm G ^ 2 , where 1 < J < p and 
1 < m < n. The value of Xjm determines the value of the m-th position of the 
j-th center. Let T be a fcxp-matrix of variables i/ij € Z 2 , where 1 < i < fc and 
1 ^ J Y p. Uij = 1 only if row j of X is a center string that is closest to ai, 

p 

so that for each i = 1, .., fc, we have ^ pij = 1. Next, for each i = 1, .., fc and 

i=i 

j = 1, ..,p, we have the inequality 

^ ^ X^jm T ^ ^ (1 Xjm) ^ £? T (1 yij)'D 

06 i[m] — 0 Oii[m] — 1 

l<m<n l<m<n 

where D = max ( max d(ai,ai)). 

l<j<k l<i<k ^ 

The above system of inequalities can be transformed to the form Ax < b, 
where A is a (fcp) x (np-l- kp) integer matrix, a; is a variable vector over 
and 6 is a vector in Z^^. Note that the scalar product of any prefix of any row of 
A with a 0 — 1- vector of the same length is neither less than — n nor greater than 
n + D. In particular, when p = 1, such a product has its absolute value simply 
bounded by n. Now, we can solve the transformed system of kp inequalities by 
a well-known dynamic programming procedure ^3, proceeding in stages. At 
the jth stage, we compute the set Sj of all vectors that can be expressed as 
S/=i where cj is the Ah column of A and zj G Z 2 . Since the Sj cannot be 
larger than (2n + D + 1)^^ (or (2n -|- 1)^ if p = 1), the whole procedure for a 
fixed g takes 0((2n -|- D)^x>2^p [ np + kp)) time (or 0(n^2^^(n -|- k)) if p = 1). 
Hence, by using binary search to find the smallest possible g, we conclude that 
HRC for k = 0(1) and p = 0(1) can be solved in polynomial time. 

Theorem 1. HRC for instances with k strings of length n is solvable in 
time. 

On the other hand, if n = O(logfc), exhaustive search yields a k^^P^-time 
solution. 

Theorem 2. HRC restricted to instances with k strings of length O(logfc) is 
solvable in k^^P^ time. 

One of the main differences between HDC and HRC is that the former doesn’t 
involve strings outside the input set S. For this reason it seems simpler to solve 
exactly than HRC does | For example, it has a simpler integer programming 
formulation involving only a single matrix of indicator variables. Furthermore, 
it can be solved by exhaustive search in 0{k^n-\-k^p^) time, which immediately 
yields the following result. 

^ Paradoxically, as for approximation in terms of the number of clusters it might be 
more difficult, as is observed in the next sections. 
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Theorem 3. HDC restricted to instances with O(logn) strings of length n is 
solvable in time. 

More interestingly, the Hamming diameter 2-clustering problem admits the 
following, rather straightforward polynomial-time solution. Let d be a candidate 
value for the maximum Hamming cluster diameter in an optimal 2-clustering 
of the k input strings of length n. Form a graph G with vertices in one-to-one 
correspondence with the input strings, and connect a pair of vertices by an edge 
whenever the Hamming distance between the corresponding strings is less than 
or equal to d. Now, the problem of Hamming diameter 2-clustering for the input 
strings becomes equivalent to that of partitioning the vertices of G into two 
cliques. The latter problem in turn reduces to 2-coloring the complement graph. 
By breadth-first search, we can find a 2-coloring of the complement graph, if one 
exists, in O(fc^) time. To find the smallest possible d, we use the procedure just 
described to test different values of d, generated by a binary search. Calculating 
all pairwise Hamming distances requires 0{k^n) time, but this can be done 
before starting the search for d. Hence, we obtain the following result. 

Theorem 4. For p = 2, HDC is solvable in 0{k^n) time. 

Note that TheoremHcan be generalized to any metric. 

3 NP-Hardness of Approximating HRC and HDC 

By approximating HRC or HDC, we mean providing a polynomial-time algo- 
rithm yielding a p-center set or a p-clustering approximating the p-radius or 
the p-diameter, respectively. Our results from the first subsection prove the 
NP-hardness of this type of approximation of HRC and HDC. In the second 
subsection, we consider another kind of approximation of HDC relaxing the re- 
quirement on the number of produced clusters under the condition that their 
diameter doesn’t exceed the p-diameter; we show that it is NP-hard to approxi- 
mate the number of clusters within any reasonable factor. 



3.1 NP-Hardness of Approximating the p-Radius and p-Diameter 

To prove the hardness results in this subsection, we use the reduction described 
in H from vertex cover for planar graphs of degree at most three to the corre- 
sponding p-clustering problem in the plane under the Li metric. (The radius p- 
clustering problem in the plane under the L\ metric is the following: For a finite 
set S of points in the plane, find a set P of p points in the plane that minimizes 
maxmindi(s, u), where di is the Li distance. The diameter p-clustering problem 

sGS uGP 

in the plane under the L\ metric is defined correspondingly.) 

By straightforward inspection of the aforementioned reduction from vertex 
cover for planar graphs | and using, e.g., the planar graph drawing algorithm 
from I in order to embed the input planar graph in the plane, we can ensure 
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that the points in the resulting instance of the p-clustering problem in the plane 
as well as p points in an optimal solution lie on an integer grid of size polynomial 
in the size of the input planar graph and (a — 1)”^. This yields the following 
technical strengthening of Theorem 2.1 from Q. 

Lemma 5. Let a be a positive eonstant not less than 1. The radius p- clustering 
and diameter p- clustering problems for a finite set S of points in the plane with 
the L\ metric, where the points in S lie on an integer grid of size polynomial 
in the cardinality of S and {a — 1)“^, and where the approximative solution to 
the radius version is required to lie on the grid, are NP-hard to approximate 
within a. 

By using the idea of embedding the Li -metric on a integer square grid into 
the Hamming one, we obtain our main result in this section. 

Theorem 6. HRC and HDC are NP-hard to approximate within any constant 
factor smaller than two. 

Proof. Let S' be a set of points on integer square grid of size q{\S\) where q{) is 
a polynomial. Encode each grid point s of coordinates Sx and Sy respectively by 
the 0 — 1 string e(s) of length 2g(|S|) composed of Sx consecutive I’s followed 
by 9 (|S|) — Sx consecutive O’s, next Sy consecutive I’s, and finally, 9 (|S|) — Sy 
consecutive O’s. Note that for any two grid points s' and s" their Li distance 
is equal to the Hamming distance between their encodings e(s') and e(s"). This 
observation yields immediately the theorem thesis for HDC by Lemma^ 

Consider an approximative solution ai,a 2 ,..,Op to HRC problem for the 
strings e(s), s € S. For i = 1, ..,p, we can transform ai to a[ having the form of 
for some l,m < qhy moving all the I’s in the first half of 
Qi to the appropriate prefix of ai and similarly moving the remaining I’s to the 
appropriate prefix of the second half of ai and filling the left positions with the 
left O’s. Observe that the resulting string sequence a^, .., yields at least as 
good solution as ai, 02 , ..,Op for the strings e(s), s S S' by the special form of 
the e(s)’s. Also, it can be immediately decoded into a sequence of grid points 
91 , 92 , ■■,9p such that a' = e{gi) for i = 1, ..,p. Putting everything together, we 
obtain the theorem thesis for HRC by Lemma^ □ 

3.2 NP-Hardness of Approximating HDC in Terms of the Number 
of Clusters 

Consider the following clique partition problem: Given an undirected graph G 
and a natural number p, partition the set of vertices of G into pairwise disjoint 
subsets Vi, ..,Vp such that for j = 1, ..,p, the subgraph of G induced by Vj is 
a clique. Clearly, this problem is equivalent to coloring the complement graph 
with p colors. It follows from known inapproximability results for graph coloring 
P that for any e > 0, the problem of finding an approximative solution to the 
clique partition problem consisting of cliques, where n is the number 

of vertices in the instance graph G, is NP-hard. For our purposes, it will be 
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convenient to assume that the instance graph is quasi-regular, by which we mean 
that it satisfies the following two properties: 

1. It contains two distinguished cliques. 

2. All vertices outside the two cliques have the same degree, which is not less 
than that of any vertex in the cliques. 

To achieve this, we can augment G with two cliques on n auxiliary vertices each. 
Next, we connect each original vertex in G of degree q to we equally distribute 
the new connections in a cyclic fashion so that each vertex of the two cliques 
receives at most [(2n^ — 2m)/2n] connections to the original vertices. Let G* 
be the resulting graph on 3n vertices. Note that all original vertices have degree 
2n and all vertices in the n-cliques have degree at most 2n in G*. It is clear 
that if the vertices of G can be partitioned into I cliques then the vertices of 
G* can be partitioned into at most I + 2 cliques. Conversely, if the vertices of 
G* can be partitioned into I cliques then the vertices of G can also be trivially 
partitioned into at most I cliques. Putting everything together, we obtain the 
following technical lemma. 

Lemma 7. For any e > 0, the clique partition problem restricted to quasi-regular 
graphs cannot be approximated (in terms of the number of cliques) within 
unless P=NP. 

By a reduction from the clique partition problem for quasi-regular graphs to 
HDC, we obtain the following result. 

Theorem 8. For any e > 0, the problem of finding a partition of a set of k 
binary strings of length 0{k^) into at most pk^^'^~’^ disjoint clusters such that 
each cluster has Hamming diameter not exceeding the p-diameter is NP-hard. 

Proof. Consider an instance of the restricted clique partition problem consisting 
of a quasi-regular graph G on fc vertices and m edges, and a natural number p. 
Enumerate the edges of G. For each vertex v of G, form a string s(u) of length 
m such that there is a 1 on the ith position in s{v) iff the ith edge of G is 
incident to v. Let d be the maximum vertex degree of G. It follows that each 
vertex in G outside the two distinguished cliques has degree d. Note that for any 
pair of vertices vi, V 2 in G of degree d, the Hamming distance between s(ui) 
and s{v 2 ) is 2c? — 2 if they are adjacent, otherwise it is 2d. Also, for any pair 
of vertices vi, V 2 in the same distinguished clique of G, the Hamming distance 
between s(t>i) and s{v 2 ) is at most 2d — 2. Therefore, any clique p-partition of 
G yields a p-clustering of the resulting strings of maximum Hamming diameter 
less than or equal to 2d — 2. Conversely, any g-clustering of the resulting strings 
of maximum Hamming diameter less than or equal to 2c? — 2 trivially yields a 
clique {q -I- 2)-partition of G. Hence, by Lemma^we obtain our result. □ 

As for the corresponding problem for HRC (i.e., producing a larger set of 
approximative centers such that each input string is within the p-radius from at 
least one of the centers), we doubt whether it is equally hard to approximate. 
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At least, if we weaken the requirement of being within the p-radius by a multi- 
plicative factor of 1 -|- e, then this problem admits a logarithmic approximation 
in polynomial time, as it is shown at the end of the next section. 

4 Approximation Algorithms for HRC and HDC 

In this section, we first observe how an approximation factor of two for HRC 
and HDC can be achieved. Next, we provide an approximation scheme for HRC 
running in polynomial time when p — 0(1) and g = 0(log(fc -I- n)). Finally, we 
give a relaxed type of arbitrarily close approximation of g due to a moderate 
increase in the number of clusters which runs in polynomial time whenever g = 
0(log(fc -I- n)). 

4.1 A 2- Approximation Algorithm for HRC and HDC 

To obtain an approximation factor of two, we adapt Gonzalez’ farthest-point 
clustering algorithm Q to HRC and HDC respectively as follows: 

Algorithm A 

STEP 1: Set P* to {oi}, where ai is an arbitrary string in S. 

STEP 2: For I = 2,..,p : augment P* by a string in S that maximizes the 
minimum distance to P* , i.e., that is as far away as possible from the strings 
already in P* . 

STEP 3 (HRC): Return P* . 

STEP 3 (HDC): Assign each string in S' to a closest member in P* and return 
the resulting clusters. □ 

The Hamming distance obeys the triangle inequality (^], p. 424). Therefore, 
by the proof of Theorem 8. 14 in Q], Algorithm A yields an approximative solution 
to either HRC or HDC that is always within a factor of two of the optimum. We 
can implement this algorithm by updating the Hamming distance of each string 
outside P* to the nearest string in P* after each augmentation of P* . To update 
and then compute a string in S furthermost from P* takes 0{kn) time in each 
iteration. Hence, we obtain the following theorem. 

Theorem 9. An approximative solution to either HRC or HDC that is always 
within a factor of two of the optimum can be found in 0{pkn) time. 

4.2 An Approximation Scheme for HRC 

In this subsection we present a (1 -I- e)-approximation 

algorithm for HRC. Our scheme is partly based on the idea used in the PTAS 
for 1-HRC in 
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Algorithm B 

STEP 1: Set C to an empty subset of 12^. For each subset i? of S' having exactly 
r strings, compute the set Q consisting of all positions m, 1 < m < n, on which 
all strings in R contain the same symbol. Set P to {l,2,..,n}\ Q. For every 
possible / : P ^ { 0 , 1 }, let g/ be the string in Z 2 which agrees with the strings 
in R on the positions in Q and contains f{j) in each position j € P. Augment C 
by qf- 

STEP 2: Let be the family of all subsets of the set C of size p. Test all sets 
in and return the P* S that minimizes max min dnion.c). □ 

l<i<k cGP* 



The next lemma can be proved analogously as Lemma 11 in (the key 
lemma for the PTAS for the Hamming radius 1-clustering problem) is proved in 
case of a logarithmic or smaller sized radius. 

Lemma 10. For any subset U of S, there is a c in C such that 

maxdij(a, c) < (1 H ) min maxdi^(a, B) 

aeu 2r - 1 /3 gzj ae(7 



Theorem 11. Algorithm B constructs a p-center with the approximation factor 

1 -I- 27 ^ time. 

Proof. To prove the correctness and the approximation factor of Algorithm B, 
consider an optimal p-center for S, say {/3i, ..,/3p}. Partition S into subsets U\ 
through Up such that for 1 < j < p and a G Uj, Pj has minimum Hamming 
distance to a among Pi,..,Pp. By Lemma^J the set constructed in STEP 

2 contains {/?*, ..,/?*} such that for 1 < j < p and any a G Uj, the Hamming 
distance between a and /3* is at most 1 -I- times the radius of Uj. Thus, 
Algorithm B yields a solution within 1 -|- 57 ^ of fbe optimum. 

To derive the upper bound on the running time of Algorithm B, first observe 
that each of the sets P has size at most rg and that a string qf can be constructed 
in 0{nr) time. Hence, the size of the set C doesn’t exceed 2’'^fc’’, and C can be 
constructed in 0{r2''^U'n) time. Consequently, is of size at most ^’'^2^"’^’ and 
its construction from C takes 0{2P^^kP^n) time. All that remains is to note that 
the test of each p-tuple in can be performed in 0{kn) time. □ 

Note that the running time of Algorithm B is polynomial in n and k as long 
as p is a constant and g = 0(log(fc -I- n)). 

Corollary 12. Algorithm B yields a polynomial-time approximation scheme for 
the Hamming radius 0(1) -clustering problem restricted to instances with the p- 
radius in 0(log(fc -I- n)). 
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4.3 A Relaxed Type of Approximation for HRC 

In this subsection, we consider twofold approximation for HRC allowing for 
producing more than p approximative centers and slightly exceeding the p-radius. 

For each c in C (see Algorithm B), let S{c) be the set of all strings in S 
within distance (1 + 2 r-i Lemma^J there is a set consisting of p 

such sets, covering all of S. If g is known, we run the classical greedy heuristic 
for minimum set cover (see Q) on the instance (S, {5(c) | c S C}) to find a set of 
0(plog k) sets covering S. Otherwise, we perform a binary search for the smallest 
possible value of p G (0, 1, n} in the definition of the sets S{c) by running the 
aforementioned heuristic O(logn) times and each time testing whether or not 
the resulting cover of S has size O(plogfc). Recall that \C\ < and that C 

can be constructed in 0(r2’’^’fc’’n) time. The instance of set cover corresponding 
to a given value of g can be constructed in 0{\C\kn) time; the greedy heuristic 
can be implemented to run in 0(|C|fc^) time. By choosing r so that < r < |, 
we obtain the following result. 

Theorem 13. For any constant 0 < e < 1, we can construct a set L ofO{plog k) 
strings of length n in 0((j + fcnlogn + fc^logn)(2^fc)^/^) time such that for each 
of the k strings in S there is at least one string in L within distance (1 + e) of 
the p-radius. 

The time bound in Theorem is polynomial in n and k as long as p = 
0(log(fc + n)). 

5 Conclusions 

We have shown not only that two is the best approximation factor for HRC and 
HDC achievable in polynomial time unless P=NP, but also that it is possible to 
provide exact solutions or much better approximation solutions to HRC or HDC 
in several special or relaxed cases. It seems that there are plenty of interesting 
open problems in the latter direction. For example, is it possible to design very 
close and efficient approximation algorithms for protein data (see Section 1.1) 
taking into account the specific distribution of the input? 
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Abstract. The Maximum Isomorphic Agreement Subtree (MIT) prob- 
lem is one of the simplest versions of the Maximum Interval Weight 
Agreement Subtree method (MIWT) which is used to compare phyto- 
genies. More precisely MIT allows to provide a subset of the species 
such that the exact distances between species in such subset is preserved 
among all evolutionary trees considered. In this paper, the approximation 
complexity of the MIT problem is investigated, showing that it cannot 
be approximated in polynomial time within factor Mg'* n for any 5 > 0 
unless NP C DTIME(2^°‘^‘°®") for instances containing three trees. 
Moreover, we show that such result can be strengthened whenever in- 
stances of the MIT problem can contain an arbitrary number of trees, 
since MIT shares the same approximation lower bound of MAX CLIQUE. 



1 Introduction 

Evolutionary trees are unordered trees where each leaf is labeled by a distinct el- 
ement in a set S of species and where all internal nodes have degree at least three. 
Evolutionary trees are frequently used by biologists to represent classifications 
of species, more precisely extant species label the leaves and edges are weighted 
with the estimated distance (i.e. temporal) between the two species represented 
by the endpoints of such edge. A number of methods to infer evolutionary trees 
have been proposed, moreover it is rather common to study different biological 
sequences or different sites of DNA, consequently various trees for the same set 
of species can be obtained. This fact motivates the compelling need to compare 
different trees, in order to extract a common history. The Maximum Agreement 
Subtree method is a basic approach that allows to reconciliate different evolu- 
tionary trees over the same set of species: it computes a subset of the extant 
species about which all trees are confident or “agree” . A general way to define 
an agreement subtree from a set Ti , • • • , Tfc of S'-labeled trees has been formal- 
ized in []]. This method assumes that each edge is labeled by an interval weight 
(a range of time to measure the duration of the evolution process) and looks for 
a subset S' of the extant species S such that: 
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— each edge of the subtree induced in each tree of the given trees is labeled by 
a value belonging to the given interval, 

— for each pair of extant species in S", the total distance between them is the 
same in all trees. 

The problem stated above is called Maximum Interval Weight Agreement Sub- 
tree (MIWT), and is a very general formulation of the problem of comparing 
phylogenies. In order to obtain more efficient solutions, some restrictions have 
been introduced to MIWT. A first natural restriction requires that each interval 
contains one distinct value; such problem is called Maximum Weight Agreement 
Subtree (MWT). A different restriction of MIWT is the one where an agreement 
subtree is homeomorphic to a subtree of each tree in the instance, since it is 
equivalent to require all intervals to be of the form [l,n — 1], where n is the 
number of extant species considered. This problem is called Maximum Homeo- 
morphic Agreement Subtree (MHW). Note that this problem is sometimes referred 
to as Maximum Agreement Subtree and is abbreviated by (MAST). A third re- 
striction of MIWT is the one where all intervals are of the form [1,1], and is called 
Maximum Isomorphic Agreement Subtree (MIT), as all subtrees induced by a fea- 
sible solution must be isomorphic. The MIT problem is also a restricted case of 
the maximum isomorphic subgraph problem, investigated in Q. Since MIT and 
MHT are the two more restricted problems among the ones we have mentioned, 
most of the efforts of developing efficient algorithms have been concentrated on 
them. 

Efficient algorithms for the MHT problem for instances of two trees have 
been widely investigated in literature. While some heuristics have been known 
^^3, the first polynomial time algorithm has been described only in 1993 by 
Steel and Warnow Afterwards successive improvements have appeared in 
literature . To our knowledge the most efficient algorithms for the prob- 

lem are due to Farach and Thorup which developed a 0(n^/^logn) algorithm 
for rooted trees of bounded degree to Cole and Hariharan for the 

case of rooted trees of unbounded degree, which gave a 0(n log n) algorithm, 
and to Kao, Lam, Przytycka, Sung and Ting which described a technique 
allowing to match the time complexity of the two previously cited algorithms 
also in the case of unrooted trees. The problems MHT and MIT over a set of 
trees, where at least one of the trees has bounded degree, can be solved in poly- 
nomial time Q, even though the time complexity is exponential in the bound 
for the degree. Moreover both problems are A^P-hard for instances containing 
three trees of unbounded degree, hence it is necessary to study the possibility of 
designing polynomial time approximation algorithms. The approximation com- 
plexity of the MHT problem has been deeply investigated in Q, where some 
strong negative results have been obtained. Since the MIT is a simplified version 
of the MIWT, it seems natural to investigate if the negative results for MHT hold 
also for MIT or if such problem is easier to approximate. In our paper we show 
that the negative results of Q hold also for the MIT problem, as a consequence 
of a nontrivial application of the self-improvement technique, consequently the 
search for polynomial time approximation algorithms achieving a constant error 
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ratio is AiP-hard even for instances consisting of only three trees. Moreover we 
have strengthened such negative results in the case of instances containing an ar- 
bitrary number of trees, as we show that MIT shares the same inapproximability 
properties of MAX CLIQUE implying that there cannot exist a polynomial 
time n^~'^ ratio approximation algorithm for each e > 0. 



2 Preliminaries 



Let S = {si, . . . , s„} be a set of labels. An ^-labeled tree has n leaves, each one 
labeled with a distinct element of S, since each label identifies unambiguously 
a leaf of the tree, in the following of the paper we will write a label x meaning 
the leaf of the tree with label x. The Maximum Isomorphic Agreement Subtree 
Problem (shortly MIT) is defined formally as follows: 

Instance: a set T = {Ti, . . . , Tm} of iS-labeled trees. 

Solntion: an S'*-labeled tree T*, with S* C S, such that T* is isomorphic to a 
subgraph of all trees in T. 

Measure: [S'* |, to be maximized. 

All trees we will deal with in this paper are rooted, that is we distinguish a 
special vertex of the tree T and we call such vertex root, denoted by r(T). All 
results presented in the paper are referred to rooted tree, but can be generalized 
to the unrooted case. 

Let r be a tree and let a,b be two nodes of T, then we will denote hy dx {a, b) 
the distance between a and b in T, that is the number of edges in the unique 
simple path from a to 6 in T. Let T be a rooted tree, and let t be a node of T, 
then the depth of t in T is the distance of t from the root of T. The depth of 
a tree T, denoted by depth(T), is the maximum among the depths of its nodes. 
Given two leaves a,b oi T we define the least common ancestor, of a and b in 
T, denoted by lcaT(a, b), as the maximum depth node of T which is ancestor of 
both a and b. 

It is immediate to note that the NP-completeness proof given by Amir and 
Keselman in | is an L-reduction, as pointed out in | for the MHT problem. 
Similarly it is possible to prove that MIT is MAX SNP-hard, that is there is 
no polynomial time approximation scheme for it, unless P = NP. 

Anyway, differently from Q, to prove our inapproximability results such 
MAX SNP-hardness proof is not sufficient, but we have to deal with a restricted 
version of the problem; more precisely we consider only instances consisting of 
trees having leaves all at the same depth in every tree. Formally dxiia, r{Ti)) = 
dxjib, r{Tj)) for all a,b G S and every pair of trees Ti, Tj in the instance. We will 
say that trees in such instances are restricted. This new problem will be called 
R-MIT. Clearly all inapproximability results for that problem hold also for MIT. 

The following Lemma, proved in 
each instance of R-MIT. 



It: 



characterizes all feasible solutions of 



Lemma 2.1. Let T be a set of S -labeled trees, and let S* C S. Then there exists 
a S* -labeled tree T* that is isomorphic to a subgraph of each tree in T iff for 
each pair of labels a,b G S* , a and b have the same distance in all trees in T. 
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As a consequence we can identify a feasible solution of an instance of MIT as 
a subset of its label set. The following property of trees, whose straightforward 
proof is omitted, will be used in the following of the paper. 



Proposition 2.2. Let a, b be two leaves of a S-labeled tree with root r. Then 
dr^a, b) = dria, r) + drib, r) — 2 drir, lcaT(o, b)). 



3 R-MIT Is MAX SNP-Hard 



In this section we are going to prove that the R-MIT problem is MAX SNP- 
hard. This results is necessary to prove that MIT is hard to approximate even 
on instances consisting of only three trees. 

The problem used in the L-reduction to R-MIT is the Threedimensional 
Bounded Matching (shortly 3DM-B). An instance of 3DM-B consists of three pair- 
wise disjoint sets < Ai , A2, A3 > and a set M of triples where M C Ai x A2 x A3 
and every element in Ai U A2 U A3 occurs in at least one and at most B triples 
of M. The goal is to find a maximum cardinality subset M\ of M such that no 
two triples in Mi agree in any coordinate. The general 3DM-B problem is MAX 
SNP-hard Q. 

Let j\4 =< Xi, X2, X3, M > be an instance of the 3DM-B problem, with 
M C Ai X A2 X A3, Aj = {xi^i, Xi^2, ■ ■ ■Xi,\Xi\}- Then we will associate to M an 
instance < Ti, T2, T3 > of MIT. Each tree Ti consists of the following nodes and 
edges: a root labeled r^, a node connected to the root for each element of A^, 
and finally for each element Xij there is a node for each triple in M containing 
Xi^j and such node is adjacent to the node associated to Xi^. Then each tree Ti 
is M-labeled. 



r 




Fig. 1. Example of instance of R-MIT associated to an instance of 3 DM- B 
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Since the distance from each node to the root is 2 in all trees of the instance 
of MIT associated to an instance of 3DM-B, such set of trees is an instance of 
R-MIT. The following Lemma is an immediate consequence of such fact. 

Lemma 3.1. Let A 4 =< Xi,X2,Xs,M > be an instance o/ 3DM-B and let 
< Ti, T2, T3 > be the associated instance of MIT. Given a tree Ti with 1 < i < 3, 
and given two distinct leaves s,t ofTi, then the distance of s and t in Ti is 2 or 

4. 



Note that the distance of two leaves s and t in a tree Ti is 2 if and only if s 
and t are labeled by triples of M that share the same element in the set Xi. 

Lemma 3.2. Let Ai =< Xi,X2,X3,M > be an instance of 3DM-B, let 
< Ti,T2,Ts > be the associated instance of R-MIT and let S C M. Then S 
is a feasible solution of < Ti,T2,T^ > iff each pair s, t of distinct triples in S 
has distance 4 in all trees Ti. 

Proof. By Lemma^HS' is a feasible solution iff each distinct pair s, < of triples 
in S have the same distance in all trees Ti., that, by Lemma is either 2 or 
4. Assume to the contrary that there exists a pair s, t that has distance 2 in all 
trees. Then by construction s is equal to t, contradicting the fact the all sets in 
M are distinct. The other direction follows immediately by Lemma^^ 



Theorem 3.3. The reduction from 3DM-B to R-MIT is an L-reduction. 

Proof. Follows from Lemmas^3^3 

4 Product of Trees 

The inapproximability result over instances of three trees is obtained by means 
of the self-improvement technique. In Q such technique has been exploited to 
prove a similar result for the MHT problem. Such technique requires a careful 
definition of product between instances of the problem, defined as follows: 

Definition 4.1. Let T\ be a Si-labeled tree, let T2 be a S2~labeled tree and let s 
be a leaf of Ti, then the tree T2,s is the tree obtained from T2 relabeling each leaf 
S2 with the sequence SS2. Then the product T\ ■ T2 is the tree obtained from T\ 
replacing each leaf s with the tree T2 s- 

Let r be a S'-labeled tree, then T'^ = T ■ T and T* = ■ T, i > 2 . Note 

that the label of a leaf of the tree is a string si . . . Sfc of fc symbols over the 
alphabet S. 

An immediate property of the product of trees is stated below: 

Proposition 4.1. Let T\, T2 be two restricted trees. Then T\ ■ T2 is also a 
restricted tree. 
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T, 




A 



A 



a b c 



d 



e 



bd be 



Fig. 2. Product of trees 



The following Lemma points out the motivation for our definition of product . 

Lemma 4.2. Let Ti^T^ be two restrieted S-labeled trees, let a,b be two labels 
in S and let a, f3 be two strings of k — 1 symbols of S. Then drpk (aa, (3b) = 
drpk{aa,(3b) iff dr,^{a,b) = dr^io-^b) and drpk-i{a, (3) = dj,k-i{a, (3) 

Proof. Please note that Prop. ^3 together with the fact that Ti and T 2 are re- 
stricted trees (that is in Ti and T 2 all leaves have the same depth), implies that it 
is sufficient to prove that (Ica^fc (aa, /? 6 ), r(Pf )) = (Ica^fc (aa, /36), r(P 2 ^)) 
iff dTi(lcaTi(a, 5), r(Ti)) = dT 2 (lcaT 2 (a, 5), r(T 2 )) and dj.fc-i (Ica^fc-i (a, /3), 



Initially let us consider the case a = (3, then, by definition of product. 



drpk{lca,rpk{aa, /3b),r{Tf)) = dj^k{\c&j<k{aa, (3b),r{Tf)) if and only if 

dT’i(lcaTi(a, 5), r(Ti)) = dT 2 (lcaT 2 (a, 5), r(T 2 ))- Assume now that a yf /3, then 
\c&j,k{aa, (3b) = Ica^fc-i (a, /3) and Icaj* (aa, /35) = Ica^k-i (a, /3). Consequently 
drpk{lca,rpk{aa, pb),r{Tf)) = drpk{lca,rpk{aa, pb),r{Tf)) if and only if 



prove the Lemma. 

The following lemma relates a feasible solution of < Ti,T 2 ,T^ > with a 
feasible solution of < Pf , Pg >. 

Lemma 4.3. Let < Pi, P 2 , P 3 > be an instanee o/R-MIT, and let F be a feasible 
solution of such instance. Then it is possible to compute in polynomial time a 
solution of < Tf,Tf,T^ > whose cost is cost{F)^ . 
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Proof. Let Fk be the set of strings of labels {fi • ■ • fk ■ fi & F, 1 < i < k}. 
We will prove that for each pair of strings of labels fa^ ■ ■ ■ fakt f/ 3 k Fk, 

their distance is the same in all trees Tf, T 2 , Tg . Let ki be the minimum integer 
such that Ffcj is not a solution of < , Tg^ >, w.l.o.g. we can assume that 

fan ■ ■ ■ fak, fpi' " ff 3 k do not have the same distance in and in . Note 
that ki > 1, as drAfai, f/3i) = dT2(/ai,//3j = dr^ifai, f/ 3 i) for all i, since all 
are in S. But Lemma implies that dr^UakifPk) ^ dr^ifak^ fi3k) or 
drpki-i {fa-i ' ‘ ‘ foik-i , f/Si ‘ ‘ ‘ fpki-i ) 7^ drpk .,-1 {fa-i ‘ ‘ ‘ fotk-i i f 0i ' ' ' fpki-i )j Con- 
tradicting the minimality of ki . 

Theorem 4.4. Let < Tf,T 2 ,T^ > be an instance of R-MIT and let Sk be a 
feasible solution of < Tf, T 2 , Tg >, then it is possible to compute in polynomial 
time a feasible solution Si of < Ti, T 2 , T 3 > such that cost{Sk) < cost{Si)^ . 

Proof. Let Sk = {fa^, ■ ■ ■ , /a^} be a feasible solution of < Tf, T 2 , Tg >. By ap- 
plying Lemma ^3 iteratively we can obtain k feasible solutions Fi of 

< Ti,T 2 ,Ts >, where each solution Fi contains exactly the symbols fc^ of S 
that are in the i-th position of a string in Sk. Let F* be the largest of such Fi 
and let Ff be the set of strings {fi . . . fk : fj G F*, 1 < j < k}. Just as in 
the proof of Lemma it is possible to prove that F^ is a feasible solution of 

< Tf ,T 2 ,T^ >. An immediate counting argument and the fact that F* is the 

Fi of maximum cardinality imply that | | < | | = | F* | ^ . 

We are now able to state our main results: 

Theorem 4.5. There does not exists a constant-ratio polynomial-time approx- 
imation algorithm for R-MIT unless NP=P. 

Proof. Assume to the contrary that there exists an e-approximation polynomial- 
time algorithm for R-MIT. Then let a > 0, and pose k = [log^e], consequently 
> e. Since there exists an e-approximation algorithm for R-MIT let be Apx(< 
Fi,F 2 ,Fg >) be the solution returned by such algorithm for the instance < 
Fi, F 2 , Fg >, while Opt(< Fi, F 2 , Fg >) denotes the optimum solution. Then, by 

Lemmas ^3 ^3 



/ Opt(< Fi, F 2 , Fg >) y _ Opt(< Ff , Tj, Tj >) ^ ^ 

VApx(< Fi, F 2 , Fg >)J Apx(< Ff , Fj= ^ “ 

hence ( Apx(<T^,T 2 T 3 >j ) ^ote that computing < Tf,T 2 ,T^ > from < 

Fi,F 2 ,Fg > can be done in 0(nd°Sa<=l) time, hence we have described a PTAS 
for R-MIT. By Theorem ^3NP=P. 



Corollary 4.6. There exists a constant J > 0 such that R-MIT cannot be ap- 
proximated within factor log"^ n in polynomial time, unless NP C 
DTIME[2P°'2^'°9"]. 
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Proof. Assume to the contrary that for all (5 > 0 there exists a log^ n-approxi- 
mation polynomial-time algorithm for R-MIT. Then let a > 0 , and pose k = 
[loglog'^n], consequently > log"^ n. Just as in the proof of Theorem we 
will denote with Apx(< Ti, T2, T3 >) the solution returned by the approximation 
algorithm for the instance < Ti, T2, Ta >, while Opt(< Ti, T2, Ta >) denotes the 
optimum solution. Then, by Lemmas^ 3^3 

/ Opt(< Ti, T2, n>) Y _ Opt(< rf , Tl t! >) ^ , 

Upx(< Ti, T2, Ta >) ) Apx(< Tf , 

taking the logarithms of both sides 



fclog 



/ Opt(<ri,r2,T3>) \ 

VApx(< Ti,T2,r3 >)) 



< log(log^ n) 



Consequently 



[loglog'^ n] log 



/ Opt(<Ti,T2,ra>) \ 
VApx(< Ti,T2,Ts >)J 



< log(log'^ n) 



implying that log(^^g^^^) < 1 . Hence ^ 

diate to note that that computing < > from < Ti,T2,Ts > can be 

done in ” 1 ) = 2P°^y^°sn time. Thus the claim follows from Thm .^3 



5 Inapproximability over Unbounded Number of Trees 

The inapproximability result presented in the previous section can be strength- 
ened when instances are not required to contain exactly three trees, but can con- 
tain an arbitrary number of trees. This can be proved by a simple L-reduction 
from MAX CLIQUE. Since such reduction preserves the optimum and the cost of 
approximate solutions, MIT with unbounded number of tree inherits the same 
inapproximability results of MAX CLIQUE, that is it cannot be approximated 
within n^~'^ for each e > 0 , unless ZPP = NP An instance of MAX CLIQUE 
is an unoriented graph G =< V,E >, and a feasible solution is a clique of G, 
that is is a subset C C V such that (ci, C2) G E for each pair ci, C2 of vertices in 
G. The goal is to maximize |G|. 

The reduction is quite simple: let G = {V, E) be a graph with E ^ The 
instance of MIT contains the H-labeled trees in the set {Tedge} U {Tij : z,j G 
V, {i,j) ^ ^}) where Tedge has root r and each leaf v of Tedge has py as parent 
and Pv is a child of r. Each tree Tij consists of a root r, a node pij that is the 
parent of both leaves Vi, vj a node Pz for each z G V — {vi^vj} and each pz is 
the parent of the leaf Vz- Moreover pij and all Pz with z G V — {i, j} are the 
children of the root. 

The following Lemma points out the structure of all feasible solutions con- 
sider in our reduction. 
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Lemma 5.1. Let T be the set of V -labeled trees associated to an instance G =< 
V, E > of MAX CLIQUE, and let T be a feasible solution of MIT(T). Let V\,V 2 
be two distinct leaves of T. Then the distance between v\ and V2 in T is four. 

Proof. Let vi, V2 be two distinct leaves of T. Since vi and V2 are both in a 
feasible solution T of T, by Lemma ^Jtheir distance must be the same in all 
trees in T. Since dT^^g^{v\, V2) = 4 then drivi, V2) = 4. 

We will show how to compute a feasible solution of MIT from a feasible 
solution of MAX CLIQUE and vice versa, so that the costs of the solution are the 
same. 

Let T be the instance of MIT, associated to the instance G =< V,E > of 
MAX CLIQUE, and let V\ G V he a, feasible solution of T. Please note that, by 
Lemmas^Hand^Ja subset Vi C 1/ is a feasible solution of T iff dxivi, V 2 ) = 4 
for each pair of distinct elements fi, Vj € Vi and each tree T G T. We will prove 
that Vi is a clique of G. Assume to the contrary that Vi is not a clique of G, that 
is there exist two vertices Vi,Vj € Vi such that (vi,Vj) € E. By construction in 
T there is the tree Tij, and dTi^ivi, Vj) = 2. Consequently by Lemma^Jrii and 
Vj cannot both be in a feasible solution of T. To compute a feasible solution of 
T from a clique of G is trivial, hence the following theorem follows: 

Theorem 5.2. MIT over an unbounded number of tree cannot be approximated 
within for each e > 0, unless ZPP = NP. 

6 Conclusions 

The MIT problem is one of the simplest formulations of evolutionary trees com- 
parison proposed in literature, while the most studied of such formulations is 
the MHT problem. In our paper we have shown that MIT shares the same inap- 
proximability bounds of MHT whenever the instances are restricted to contain 
exactly 3 trees, while it inherits the same bounds of MAX CLIQUE when the 
instances are unrestricted. 
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Abstract. A widely- used method for determining the similarity of two 
labeled trees is to compute a maximum agreement subtree of the two 
trees. Previous work on this similarity measure only concerns with the 
comparison of labeled trees of two special kinds, namely, uniformly la- 
beled trees (i.e., trees with all their nodes labeled by the same symbol) 
and evolutionary trees (i.e., leaf-labeled trees with distinct symbols for 
distinct leaves). This paper presents an algorithm for comparing trees 
that are labeled in an arbitrary manner. In addition to the generaliza- 
tion, our algorithm is faster than the previous algorithms in many cases. 



1 Introduction 



A labeled tree is a rooted tree with an arbitrary subset of nodes being labeled 
with symbols. Labeled trees are used to model the relationship of objects in 
real-life systems. In recent years, many algorithms for comparing such trees 
have been developed for diverse application areas including biology 
chemistry linguistics computer vision pattern recognition , 

and structured text databases 

A widely-used measure of the similarity of two labeled trees is the notion of 
a maximum agreement subtree. A labeled tree R is said to be a label-preserving 
homeomorphie subtree of another labeled tree T if there exists a one-to-one 
mapping / from the nodes of R to the nodes of T such that for any nodes u,v,w 
of R, (1) u and f{u) have the same label; and (2) w is the least common ancestor 
of u and v if and only if f{w) is the least common ancestor of f{u) and f{v). 
Let Ti and T 2 be two labeled trees. An agreement subtree of T\ and T 2 is a 
labeled tree which is also a label-preserving homeomorphie subtree of the two 
trees. A maximum agreement subtree is one which maximizes the number of 
labeled nodes. 
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In some applications, each symbol z may be associated with a positive integer 
weight /i(z) to indicate the relative significance of each symbol. We generalize 
the definition of a maximum agreement subtree of T\ and T 2 to be one which 
maximizes the total weight of labeled nodes; let MAST(Ti,r 2 ) denote this max- 
imum total weight. Let n = |Ti| -|- |T 2 |, i-e., the number of nodes in T\ and T 2 - 
Let d be the maximum degree of T\ and T 2 . Let N be the maximum ^(z) over 
all the symbols z. Note that when N = 1, ^{z) = 1 for all symbols z and the 
new definition of a maximum agreement subtree is the same as the original one. 

In the literature, many algorithms for computing MAST(Ti,r 2 ) have been 
developed. These algorithms focus on the special cases where N = 1 and T\ 
and T 2 are ( 1 ) evolutionary tree i.e., leaf-labeled trees with distinct symbols 
for distinct leaves or ( 2 ) uniformly labeled trees, i.e., trees with all their nodes 
unlabeled or labeled with the same symbol. 

For evolutionary trees. Steel and Warnow Q gave the first polynomial-time 
algorithm, which runs in ® log n) time. Farach and Thorup ^ reduced the 
time complexity to ® log n). Recently, Kao, et al. Q further improved the 
time complexity to with a breakthrough for a long-standing open prob- 

lem on maximum bipartite matchings. Faster algorithms for the case d = 0(1) 
have also been discovered recently. The algorithm of Farach, Przytycka and Tho- 
rup B runs in 0{\fdn log^ n) time, and that of Kao takes 0{nd^ log^ n log d) 
time. Cole and Hariharan Q gave an 0(nlogn)-time algorithm for the case 
where Ti and T 2 are binary trees. Przytycka removed the degree-2 restric- 
tion with an 0 (-\/dn log n)-time algorithm. 

For uniformly labeled trees Ti and T 2 , MAST(Ti,r 2 ) requires longer time 
to compute. Chung ^ gave an algorithm to determine whether Ti is a label- 
preserving homeomorphic subtree of T 2 using 0(n^'^) time. Their algorithm 
provides a tool for verifying whether an uniformly labeled tree T 3 is an agreement 
subtree of Ti and T 2 in time, where m = |Ti| -I- II 2 I + IT 3 I. Gupta and 

Nishimura Q gave an algorithm which actually computes a maximum agreement 
subtree of T\ and T 2 in log n) time. 

Instead of solving special cases, this paper gives an algorithm to compute 
MAST(Ti,r 2 ) where T\ and T 2 are without restrictions (i.e., labels are not re- 
stricted to leaves and may not be distinct). The generality of our algorithm 
does not mean a sacrifice on speed. Let Wt^,T 2 = X)ugTi X)-ugT 2 where 
S{u, r;) = 1 if nodes u and v are labeled with the same symbol, and 0 other- 
wise. We will omit the subscripts of Wti,T 2 when the context is clear. When 
IV 7 ^ 1, we show that MAST(Ti,r 2 ) takes time 0{\/dW log nlog{nN)). When 
IV = 1 , we reduce the running time to 0(VdW log^ ^). Thus, if Ti and T 2 are 
uniformly labeled trees, then W < and the time complexity of our algorithm 
is 0{-\fdn^ log^ ^), which is faster than the algorithm in Q for any d. If T\ and 
T 2 are evolutionary trees, then W < n and the time complexity of our algo- 
rithm is 0{^fdn\o^ tiy This time complexity is better than the past results for 

d > n/ 2 ‘^*^V*°s"). jjj particular, '/dnlog^ ^ = 0 (n^ ®) for any degree d. 
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Section H discusses basics. Section ^details our algorithm which computes 
MAST(Ti,r 2 ) in 0{y/dW\ogn\og{nN)) time for N ^ 1 and in 0{VdW log^ 
time for N = 1. Section^concludes the paper with open problems. 



2 Preliminaries 

2.1 Basic Concepts 

Restricted Subtrees: For a rooted tree T and any node u of T, let T“ denote 
the subtree of T that is rooted at u. For any set L of symbols, the restricted 
subtree of T with respect to L, denoted by T\\L, is the subtree of T whose nodes 
are the nodes labeled with L and the least common ancestors of any two nodes 
labeled with L, and whose edges preserve the ancestor-descendant relationship 
of T. Note that T\\L may contain nodes with labels outside L; see Figurejfor 
an example. For any labeled tree T, let T||T denote the restricted subtree of T 
with respect to the set of symbols used in T. 



d 





Fig. 1. The restricted subtree T||L with L = {a, b, c}. Note that T||L contains a label 
not in L. 



Centroid Paths: A centroid path decomposition 0 of a rooted tree T is a 
partition of its nodes into disjoint paths as follows. For each internal node u in 
T, let C{u) denote the set of children of u. Among the children of u, one of them 
is chosen to be the heavy child, denoted by hvy(u), if the subtree of T rooted at 
hvy(u) contains the largest number of nodes; the other children of u are the side 
children. The edge from u to its heavy child is a heavy edge. A centroid path is 
a maximal path formed by heavy edges; the root centroid path is the centroid 
path which contains the root of T. Let T>{T) denote the set of the centroid paths 
of T. T){T) forms the desired partition and can be constructed in 0(|T|) time. 
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For each path P in T){T), let r{P) denote the node in P which is the closest to 
the root of T. Let ^ denote the ordering on T>{T) where Pi -< P 2 if and only if 
r{Pi) is a descendant of r(P 2 )- 

For each P G V{T) and each node u in P, the subtree of T rooted at a side 
child of u is a sidetree of u as well as of P. Let side(P) denote the set of all the 
sidetrees of P. Note that the size of a sidetree of P is at most 

Fact 1. Let P be the root centroid path ofTi. 

1. Wti,T 2 = ^Ti,T 2 ||Ti; thus, Wr^ToWT = VVr,T 2 for any sidetree T of P. 

2. Wti,t2 = Wp,T2 + SrGSiDE(p) ^r.T2- 

Intersecting Pairs: Consider P G T>{Ti) and Q G T>{T 2 ). For any nodes u G P 
and V G Q and any sidetrees T of u and P of v, the sidetree pair {T, P) of {P, Q) 
is intersecting if T and P have some common symbol. The sidetree-node pair 
(T, v) (and {u, P), respectively) of (P, Q) is intersecting if there exists a node in 
T (and P, respectively) which has the same label as v (and u, respectively). The 
node pair {u, v) of {P, Q) is intersecting if (1) u and v have the same label; or 
(2) there exist sidetrees T of u and P of v such that (T,P), (T,v), or {u,P) is 
intersecting. 

Let ipQ be the total number of node pairs {u, v) of (P, Q) such that u, v 
have the same symbol plus the total number of intersecting sidetree pairs and 
sidetree-node pairs of (P, Q). Let B{u, v) be the union of 

— {(T, P), {T, T 2 ), (P”, P) I {T, P) is intersecting and T and P are sidetrees of 
u and V, respectively }: 

— {(T, P2 ) I (^j is intersecting and T is a sidetree of u}; 

— {(P“, P) I {u, P) is intersecting and P is a sidetree of ?;}. 

Let B{P,Q) = \JueP,veQ^i'^^^)- \^{P^Q)\ < 3£pq. 

Maximum Weighted Matchings: Let mwm(G) denote the maximum possible 
weight of any matching of a weighted bipartite graph G = {X, Y, E). 

Fact 2. (see ^J) Let s = max{|X|, |Y|}. Let w be the total weight of the edges 
in G. mwm(G) can be computed in 0{y/sw) time. 

Our subtree algorithm needs to find maximum weight matchings for a number of 
bipartite graphs. For many of these graphs, the size of the two vertex sets may 
differ a lot. We improve the above result to take advantage of such difference. 

Lemma 1. Let t = min{|X|, |F|}. Let w be the total weight of the edges in G. 
mwm(G) can be computed in 0{y/tw) time. 

Proof. To be given in the full paper. 

This paper uses maximum weight matchings of the bipartite graphs Guv and 
Huv for any nodes u G Ti and v G T 2 defined as follows: 

— Guv is the bipartite graph between G(u) and G{v) where each edge (x,y) 
has weight MAST(Pf , P 2 ^). 
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— Huv is the bipartite graph between C{u) — {hvy(u)} and C{v) — {hvy(?;)} 
where {x, y) is an edge of H^v if and only if MAST(Tf , > 0 (i.e., (Tf , T|) 

is an intersecting sidetree pair with Tf and Tj' are sidetrees of u and v, 
respectively), and where the weight of each edge {x, y) is MAST(Tf , T^). 



Lemma 2. Consider any centroid paths P ofT\ and Q ofT 2 - Let n\ = 

and U 2 = \T 2 ^^'^\. Let Mpq be the time required to compute mwm(Huv) for all 
intersecting node pairs (u, v) of (P, Q). Lf N 1, then 



Mpq = O (^min jy/rf, ^PQ log(’^-^)) i 

if N = 1, then Mpq = O ^min ^/nl, bbj,r(p) j,r(Q) j . 



Proof. For 1, by Gabow-Tarjan matching algorithm Q, mwm(P„„) can be 
computed in O ^min \/^| ^uv log(nA^)^ time where iuv is the num- 

ber of intersecting sidetree pairs {T,P) with T and P being sidetrees of u and 
V, respectively. Thus, M.pq = 



O j min |v^, v^, ^ log(nA^) 

y u^p,v^Q 

O (^min I Vd, y/nT, (-pq \og{nN)^ . 



For = 1, MWM(i7„„) can be computed in O ^min |-\/d, -sjni, Wu^ time 

where Wuv = '^{Wr.p | T is a sidetree of u and P is a sidetree of u} (by 
LemmaH- Thus, Mpq = 

O (min |Vd, ^/nf, ^/nf^ ^ =0^min |Vd, ^/n{, y^} Wry,HP) ■ 

\ ueP,veQ j ^ ^ 



2.2 Technical Lemmas 



This section states two technical results, which are crucial to our algorithm 
for computing mast(Pi,P 2 )- It is based on the following formula, which is a 
straightforward generalization of the formula used in Q for finding maximum 
agreement subtree of two evolutionary trees. 



mast(P“, Tif) = max 



max{MAST(Pi“,r2''") I C2 G C{v)}- 
max{MAST(r(^\ P 2 ) I Cl G C{u)}', 
mwm(G„„) if u and v are unlabeled; 

mwm(G„„) -I- p{z) if both u and v are labeled z. 



Based on Equation Q, mast(Pi, P 2 ) can be computed by bottom-up dynamic 
programming. However, to find mast(P”, ) lor nodes u of Pi and v of P 2 , 
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the dynamic program needs to compute 0{n^) maximum weight matchings. To 
reduce the time on compute the maximum weight matchings, we make use of 
the following lemma, which implies that it suffices to focus on intersecting node 
pairs of centroid paths. 

Lemma 3. Let P and Q be the root centroid paths of T\ and T^, respectively. 
Suppose that we are given the values mast(T’, P) for all the pairs (T, P) G 
B{P, Q) and the values MWM(i?„„) for all intersecting node pairs (u, v) of{P, Q). 
Then, for any subset S of node pairs of (P,Q), we can compute MAST(Tf,T|') 
for all {x,y) G S in 0{{(.pQ + |S'|)logn) time. 

Proof. See Appendix^J 

Remark. Cole and Hariharan Q and Przytycka obtained this lemma for 
the case where Ti and T2 are evolutionary trees. 

The next theorem combines Lemmas Hand^ and is important to our algo- 
rithm for computing mast(Ti, T2). 

Theorem 1. Consider the root centroid path P of T\ and any centroid path Q 
ofT2. Suppose that the values mast(T, T) for all the pairs {T,P) G B{P,Q) are 
given. The following values can be computed in 0{lpQ\ogn + M.pq) time where 
A4 pQ is defined in Lemma^ 

1 . MAST(ri,T2)- 

2. mast(Ti, r2 ) mast(T“, T2) for all intersecting node pairs {u, v) of{P, Q). 

Proof. Let S = {{r{P),r{Q))} U {{r{P),v),{u,r{Q)) \ (u,v) is an intersecting 
node pair}. Note that IS"! = 0{£pq) and the set of values required by this lemma 
is {mast( 7 ’, T) I (T,r) G S}. Based on LemmasHandH these values can be 
computed using 0{£pq logn -|- M.pq) time, as stated. 

3 An Algorithm for Computing mast(Ti, T 2 ) 

In this section, we present a recursive algorithm for computing mast(Ti,T2)- 
This algorithm recursively computes mast(T’, T2) for every sidetree T attached 
to the root centroid path of T\ . The information gathered is then used to com- 
pute mast(Ti, T2) using a dynamic programming approach based on Theorem^ 
Figure 2 presents our algorithm. For the sake of recursion, the algorithm com- 
putes values other than mast(Ti, T2), namely, mast(Ti, T2 ) for all nodes v of 
T2. Steps HandO of the algorithm are further explained below. 

Step 3 is based on the following observation of Farach and Thorup 0. Let P 
be the root centroid path of Ti. Let T be a sidetree of P. Let U he a set of nodes 
in T2||^- Then for all v G T2, mast(T’, T2) = 0 if u has no descendant in U; 
otherwise, mast(T’, T2 ) = mast(T’, (T2||^)“) where u is the highest descendant 
of v in U. Thus, for every T G side(P) and every node v of T2, mast( 7 ’, Tf) can 
be retrieved by finding the highest descendant of v in U. Using Euler tours Q, 
StepHcan preprocess in linear time the 0{n) values obtained in Step^so that 
mast(T, T2) can be retrieved in O(logn) time. 
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Input: Ti, T2\ 

Output: MAST(Ti,r 2 ) for all nodes v of T 2 \ 

1. Let P be the root centroid path of Ti; 

2. For each sidetree T of P, call recursively Agree(T, T 2 \\T)-, 

3. Using the values found in Step 2, do an 0{n) time preprocessing so that for 
every T £ SIDe(P) and every node r; of T 2 , mast(T, T 2 ) can be retrieved in 
O(logn) time; 

4. For each centroid path Q £ T>{T 2 ) in increasing order according to 

(a) Extract the values of mast(T, P) for all (T, P) £ B{P, Q)\ 

(b) Find mast(Ti,T 2 ) and mast(T“, ) for all intersecting node pairs 
(m, v) of (P, Q) based on Theorem J 

(c) Find mast(Pi, ) for fol nodes v of Q\ 



Fig. 2. Algorithm Agree(Ti, T^) 



Let us consider StepJ We handle Q £ V{T 2 ) in increasing order according 
to A. When the algorithm handles Q, it first retrieves mast(T’, P) for all (T, P) G 
B{P,Q) based on Step ^3 Recall that for every pair (P, P) of B{P,Q), either 
P is a sidetree of P or P is a sidetree of Q. For all (P, P) G B{P, Q) such that 
P is a sidetree of P, the values mast(P, P) can be extracted in 0{^pQ\ogn) 
time after the preprocessing in StepH The rest of the pairs in B{P, Q) must be 
of the form {T^,P) where u is a node of P and P is a sidetree of Q. For each 
such (P“, P), it can be verified that mast(P”, P) is already available when the 
path pair (P, Q') is handled where Q' is the root centroid path of P. After the 
retrieval of mast(P, P) for all (P, P) G B{P,Q), the values wanted in Step ^3 
can be found based on Theorem ^ 

To explain Step ^3 vi,V 2 , ■ ■ ■ ,Vk be a subsequence of nodes in Q such 
that v\ = r(Q) and for each Vi with i > 1, there exists some u in P such that 
(u,Vi) is intersecting. The values of mast(Pi, P^”) for all Vi are available from 
Step ^3 The value of MASt(Pi, P 2 ) for v located in a subpath between Vi and 
Vi+i is mast(Pi, As a result, Step^computes the values of mast(Pi, P 2 ) 

for all nodes v of P 2 . 

Section ^Jshows that Agree(Pi,P 2 ) takes 0{VdWT^^T2^ognlog{nN)) time 
for N ^ 1. Section ^3 considers the case where = 1, for which the time 
complexity is only 0 {'/dWT^^T 2 log^ 5 )- 

3.1 Time Complexity of Agree(Ti,T 2 ) for Af 1 
Lemma 4. Let P be the root eentroid path of T\ . Then, 

^PQ = O (wp,T2 log IP2I + X! { ^PT2 log I i^L I P G side(P) 
Qex>(T2) ^ ^ I 2|| I 

Proof. Note that J^i^PQ I Q ^ T>(P 2 )} = Ai + A 2 + A 3 + A 4 where Ai = the 
total number of node pairs of (P, Q) which are labeled with the same symbol over 
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all Q G T>{T2)] A2 = the total number of intersecting sidetree-node pairs {u, F) 
of (P,Q) over all Q G T>{T2)', = the total number of intersecting sidetree- 

node pairs (T,v) of {P,Q) over all Q G T>{T2)', and A4 = the total number of 
intersecting sidetree pairs {T, P) of {P, Q) over all Q G V{T2). 

For a fixed Q, the number of node pairs of (P,Q) which are labeled with the 
same symbol is at most J 2 ueP veQ ~ ^P-.Q- Therefore, Ai = J^i^PQ I 

Q G P(r 2 )} = VFp.t,. 

For a fixed sidetree T of P and a fixed Q, the number of intersecting sidetree- 
node pairs (T,v) of (P,Q) is at most Wr,Q- Therefore, A2 = I ^ ^ 

side(P) and Q G T’(l2)} = J21^t,T2 I T G side(P)}. 

For a fixed sidetree P of a particular Q, the number of intersecting sidetree- 
node pairs (u,P) of (P,Q) is at most Wpr- Therefore, A3 = \ P G 

siDE(g) and Q G V{T2)} = Wp,t2 log 1 ^ 2 |- ’ 

For a fixed sidetree T of P, let i? be the tree formed from P 2 | 1 T by adding an 
edge between the root of T2 and that of P2IIT. Let S = U{side((3) | Q G P(P2)}- 
For each edge (x, y) in R with x being the parent of y, (x, y) corresponds to 
a simple path Q^y in T2 from x to y. By the definition of intersecting, for all 
sidetrees G 5 , (T, P) is an intersecting sidetree pair if and only if the root of 
P is on some path Q^y where (a;, y) is an edge in R. 

For each edge (x, y) in P, the number of sidetrees whose roots are on Q^y 
I I 

is less than log|^. Thus, the number of sidetrees in S which is intersecting 
with T is less than SUM(P) = |^- claim that SUM(P) = 

O (|P2||T| log ^). Then, 

^4 = O I IP2IITI log I r G SIDE(P)j) 

= O (E log I T G side(P)|^ . 

The rest of this proof shows that SUM(P) = O ^|P2||T| log ^ ■ Let Z 

be the set of all the edges {x,y) G R where y is a leaf. Since \Z\ < |P| 
and {T2 I (x,y) G is a set of disjoint subtrees of T2, J 2 {x y)ez^^Sj^ — 

log |Pf I < |P|logJj^. Let R' be the tree obtained by first removing 
all the edges in Z and then replacing every maximal internally branchless un- 
labeled path with an edge between its two endpoints. (An internally branchless 
unlabeled path is a path from a node to a proper descendant such that all its 
nodes, except possibly the endpoints, are unlabeled and have at most one child 
each.) Note that |P'| < \R \/2 and SUM(P) = |P|log^j^ -I- SUM(P'). Since 

|P| = IP2IITI, SUM(P) = o(|P|log 9 l) = O (|P2||r|log^). 



Theorem 2. Agree{Ti,T2) takes 0{VdWT^^T2^ognlog{nN)) time. 

Proof. Let <P{Ti,T2) be the time complexity of Agree(Pi , P2)- StepsHandOof 
Agree(Pi,P2) takes 0 {n) time. Step^takes time P2IIT) | T G side(P)}. 
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Step^Jtakes time 0 (^{^pq log n | Q G V{T2)}). By Theorem^ Step^Jtakes 
time 0 (X]{Vd^pQ log(nA^) | Q G V(T2)}). Step H requires 0(J2{\Q\ I Q G 
'^(T' 2 )}) = 0(n) time. In summary, 

<P{Ti,T2) = ol Y. log(nfV) + ^ HT,T2\\r) 

\QGX)(T2) / TgSIDE(P) 

By Lemmajand Fact^ 

<P{Ti,T2) = 0 (VdWTi,T 2 log II2I log(nA^)) = 0 (ydWri,T 2 lognlog(nA^)). 



3.2 Time Complexity of Agree(Ti,r 2 ) for Af = 1 

This section assumes N = 1 . Let T'{Ti,T2) be the time required for Steps H 
and the non-matching computation of Step^^for all recursion levels 
of Agree (Ti,r2)- Let T/'(ri,I^and T"{Ti,T2) be the times required for the 
matching computation of Step^Jfor the first recursion level and for all recursion 
levels of Agree ( Ti, T2), respectively. 

Lemma 5. T'{Ti,T 2 ) = 0{Wti,T2 log^ n). 

Proof. In the first recursion level, Steps^^^^l and^Jtake time 0 ( V{fpg log n 
I Q G I?(r2)}) by an argume nt si milar to the proof of Theorem H The non- 
matching comp^ation of Step requires 0 (^{^pq log n | Q G time. 

StepsHB^ 3^3 and the non- matching computation part of Step^Jin all the 
remaining recursion levels take time ^{T'(T, T2||L’) | T G side(P)}. Therefore, 
T'{Ti,T2) = EQG'D(T2)^PQlog’^ + ErGSiDE(p)'^'(^>^2||L’). By LemmaOand 
FactB T'(Ti, T2) = 0 (IUti.t 2 log logn) = 0 {Wt,,t 2 log" n). 



Lemma 6. If N = 1 , then 

r"(Ti, T2) = O (^min I VdIFTi.T2 log V\tT\Wt„T2 ^}) ' 



Proof. Let the root centroid path in T2 be the level -0 centroid path. For every 
centroid path X, if the parent of r{X) is a node in a level-z centroid path, let 
A be a level-{i 1 ) centroid path. This classification divides T>{T2) into level-i 



centroid paths for i = 1 , . . . [log IT2I] . Let Ai = J 2 



min 






Wrp^ j,r(Q) I Q is a level-i centroid path of T2 J . Note that for each level-i centroid 

path Q) IT2 I < ■^. Also, if Qi and Q2 are two different level-z centroid paths, 
then and are disjoint. Therefore, Ai is not greater than 



min vW, Y 

< min I Vd, 



nr(Q) 



Q is a level-? centroid path of T2} 
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By Theorem | the matching computation of Step 

Vd, VWT\, I Q G V{T2) 



Step^3 



takes time 
Therefore, 



Step^Jtakes time 

/ [log IT2II 

o ^ Zl, 



/[‘Qg minj? | Ti | } 1 



= o 



2=1 



[log l^2|1 



^ min I Vd, ^/\^\\ }• Wti.Ts + X! 

Wti ,T2 



Wti ,T2 







= O (min I Vd, -\/|Ti|| log 



min{(i, |Ti|} 

O ^min I VdWTi,T 2 log v 1 ^WTi,r 2 log , as stated. 



Lemma 7. If N = 1, then T”{Ti,T 2 ) = O (^'/dWT^^T 2 log^ • 

Proof. By LemmaH th® matching computation of Step^Jin the first recursion 
level takes time T{'{Ti,T 2 ) = o(uiin log \ZWi\Wt^,T 2 log |^}) • 

For the remaining recursion levels, the matching computation of Step^Jtakes 
time J2{'^”{'^,T2\\T) I r G side(P)}. Therefore, T”{Ti,T 2 ) = 



O 



J^VdWTi,T 2 v^|Ti|Wti,t 2 log 




^ T "{ T , T2 \\ r ) 

TgSIDE(P) 



■ . (2) 
As in Lemma^ the centroid paths of T\ can be classified into level-i centroid 
paths for z = 1, . . . , [log |Ti|] . Define Ai to be 



E 



T 

'/dWr,T 2 log — 7 -^ , \/|T’| Wt,t 2 log 



level-z centroid path of Ti | . 



1 ^ 

|T| 



T is a sidetree of some 



Note that |T| < 1^ for every sidetree T of some level-z centroid path of Ti. 
Also, for two different sidetrees Ti and T 2 of some level-z centroid paths of Ti, 
Ti and T 2 are disjoint. Hence, 



A, < min <1 y/dWT,,T 2 log ^ log ^ 
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Since Equation Q is a tail recursion, T"{Ti,T 2 ) = 



o E 



( 



RoglTill 



= 0 y VdlUTi T, log ^ + 

\ 





i— 



o(^v^lUn.T.log^log^^ 



= o 



O (^VdWT„T, log' . 



Theorem 3. If N = 1, then Agree(Ti, Tf) takes 0{VdWT^^T2 log' time. 

Proof. By Lemmas B D Agree(Ti,r 2 ) takes time T'{Ti,T 2 ) + T" {Ti,T 2 ) 
=0(VdlUT,.r.log' 5). 

4 Open Problems 

This paper shows that MAST(Ti,r 2 ) can be computed in 0(-\/dWri,r2 log n 
log(nA^)) time for N ^ 1 and in O ( 's/d W ti.T 2 log' §) time for = 1. Note 
that if T\ and T 2 are evolutionary trees, our results do not match the result 
in ^9 for small d. Thus, we conjecture that mast(Ti,T 2 ) can be computed in 
0 (^Wti,T 2 log(nA^)) time for 1 and in 0 (-s/dWri,T 2 log time for A^ = 1. 
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A Lemma Q 



Consider P G V{Ti) and Q G V{T2). Let u\,U2, ■ ■ ■ ,Up {vi,V2, ■ ■ ■ ,Vg, respec- 
tively) be the nodes in P (Q, respectively) in order from root to leaf. Let G(P, Q) 
be the bipartite multisubgraph between . . .,Up and v\, . . .,Vq where {ui, Vj) 
is an edge if and only if vj) is an intersecting node pair of (P, Q). The mul- 
tiplicity and weights of (ui, Uj) are specified as follows. 

If Ui and Vj have the same label or are both unlabeled, (ui,Vj) has two 
weighted copies: 

— white edge: The weight is mwm( 7L„^„^) if u and v are both unlabeled; the 
weight is MWM(iL„^„^ ) -|- if u and v have the same label z. 

— gray edge: Let be the bipartite subgraph of {C{ui),C{vj),C{ui) x 

C{vj) — {(hvy(ui), hvy(uj))}) where {x,y) is an edge of if and only if 

MAST(rf , T 2 ) > 0 and where the weight of (a;, y) is MAST(Tf , T|'). If u and 

V are both unlabeled, the weight of the gray edge is MWM(iL^.„^); if u and 

V have the same label z, the weight is the maximum of MWM(iL^ „ .) -|- /i(z), 
max{MAST(r“, P) I P is a sidetree of u}, and max{MAST(T, ) | T is a 
sidetree of u}. 

If Ui and Vj are unlabeled, then {ui, Vj) has two additional weighted copies: 

— green edge: The weight is max{MAST(P“% P) | P is a sidetree of Vj}. 

— red edge: The weight is max{MAST(T, | L’ is a sidetree of Ui}. 

If Ui and Vj have different symbols, then (ui, Vj) has only one weighted copy: 

— gray edge: The weight is the greater of max{MAST(P“, P) | P is a sidetree 
of v} and max{MAST(P, P2 ) | P is a sidetree of u]. 

Lemmajshows that G(P, Q) can be constructed in 0 {^PQ logd) time based 
on the following two facts. 

Fact 3 (See ^3)- Consider a bipartite graph G and two nodes x and y of G. 
Let M be a maximum weight matching of G — {x,y}. Then, mwm(G) can be 
computed by augmenting M using at most two augmenting paths. 



Fact 4 (See El ). For all intersecting node pairs (u,v) of (P,Q), we can 
construct graphs from in O^ipglogd) time such that mwm(P"^) = 
mwm(P(,„), \H”J = \Huv\, and H^v is a subgraph of H!f„. 



Lemma 8. Given mast(P, P) for all (P, P) G B{P, Q) and a maximum weight 
matchings of Huv for every intersecting node pair (u,v) of (P,Q), G{P,Q) can 
be constructed in 0{lpQ\ogd) time. 

Proof. This proof shows that for all intersecting node pairs (u, v) of (P, Q), the 
values mwm(P„^), mwm(P(,„), max{MAST(P”, P) | P is a sidetree of u}, and 
max{MAST(P, Tf) I P is a sidetree of u} can be computed in 0{£pq logd) time. 
Then, G{P,Q) can be constructed using 0{£pq) additional time. 
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For a fixed v of Q, let Ui^, . . Ui^ be the nodes of P with ii > i 2 > ■ ■ ■ > iz 
such that (ui^,v) is intersecting. We can compute f{k) = max{MAST(T^ ,P) \ 
F is a sidetree of u} for all k based on the recurrence /(fc) = max{/(fc — 
1), max{MAST(r“*'“ , F) I F is a sidetree of v and (F“’'“ , F) G B{ui^^,v)}}. This 
takes \B{ui^,v)\) time. Therefore, we can compute max{MAST(F”, F) | F 

is a sidetree of u} for all intersecting node pairs (u, v) in time 

vGQ uGP 

Similarly, we can compute max{MAST(F“, F) | F is a sidetree of u} for all inter- 
secting node pairs {u,v) with the same time complexity. 

The values mwm( 77„„) are already available. The rest of this proof con- 
siders the computation of mwm{H'^^). By Fact^ we can construct from 
for all intersecting node pairs (u,v) using 0{£pq log d) time such that 
MWM(i?"J = MWM(i?;j, = 0(|Ff;j) and is a subgraph of 

Then, as — {hvy(u), hvy(u)} = by FactH each mwm{H”^) can be 

computed by augmenting a maximum weight matching of Huv using at most two 
augmenting paths. Therefore, the values of all MWM(i7"„) and hence MWM(i?(j^) 
can be computed using time Eugp.^gQ \Huv\) = 0{£pq). 

In summary, the values of MWM(Ff^„) for all intersecting node pairs (u, v) can 
be computed in O {£pQ log d) time. 

For edges (ui,Vj) and (uii,Vji) of G{P,Q), {ui,Vj) is below {uii,Vji) ii i > i' 
and j > j' . They form a crossing if (i < i! and j > /) or (i > i! and j < f)- A 
matching in G{P, Q) is an agreement matching if the following statements 
hold: 

1. Its white edges, if any, do not form any crossing. 

2. In addition to those non-crossing white edges, it may contain (1) at most 
one red-green crossing (i.e., the crossing between a red edge (ui,Vj) and a 
green edge {up, Vji) with i < i') or (2) at most one gray edge which is below 
all the white edges. 

The weight of an agreement matching is the total weights of its edges. A max- 
imum agreement matching is an agreement matching in G{P, Q) which maxi- 
mizes the weight; this maximum weight is denoted as mam{G{P,Q)). The fol- 
lowing lemma shows the relationship between maximum agreement matchings 
and maximum agreement subtrees. 

Lemma 9. For any u G P and v G Q, mast(F”, T^) = mam(G(F“, Q'")). 

Proof. The proof follows from the definitions of maximum agreement matchings 
and maximum agreement subtrees. 



Fact 5 (See ^^ 9 ). Given G{P,Q) and a set S C P x Q, mam(G(F“, Q’')) 
for all {u,v) G S can be computed using 0{{£pQ |S'|)logn) time. 

LemmaHfollows from LemmasHH Fact^ 
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Abstract. Perfect phylogeny is one of the fundamental models for study- 
ing evolution. We investigate the following generalization of the problem: 
The input is a species-characters matrix. The characters are binary and 
directed, i.e., a species can only gain characters. The difference from 
standard perfect phylogeny is that for some species the state of some 
characters is unknown. The question is whether one can complete the 
missing states in a way admitting a perfect phylogeny. The problem 
arises in classical phylogenetic studies, when some states are missing or 
undetermined. Quite recently, studies that infer phytogenies using in- 
serted repeat elements in DNA gave rise to the same problem. The best 
known algorithm for the problem requires O(n^m) time for m characters 
and n species. We provide a near optimal 0(nm)-time algorithm for the 
problem. 



1 Introduction 

When studying evolution, the divergence patterns leading from a single ancestor 
species to its contemporary descendants are usually modeled by a tree structure. 
Extant species correspond to the tree leaves, while their common progenitor 
corresponds to the root of this phylogenetic tree. Internal nodes correspond to 
hypothetical ancient species, which putatively split up and evolved into distinct 
species. Tree branches model changes through time of the hypothetical ancestor 
species. The common case is that one has information regarding the leaves, from 
which the phylogenetic tree is to be inferred. This task, called phylogenetic re- 
construction (cf. |7|), was one of the first algorithmic challenges posed by biology, 
and the computational community has been dealing with problems of this flavor 
for over three decades (see, e.g., P2])- 

In the character-based approach to tree reconstruction, contemporary species 
are described by their attributes or characters. Each character takes on one of 
several possible states. The input is represented by a matrix A where is the 
state of character j in species i, and the i-th row is the character vector of species 
i. The output sought is a hypothesis regarding evolution, i.e., a phylogenetic tree 
along with the suggested character-vectors of the internal nodes. This output 
must satisfy properties specified by the problem variant. 

One important variant of the phylogenetic reconstruction problem is finding 
a perfect phylogeny. In this variant, the phylogenetic tree is required to have the 
property that for each state of a character, the set of all nodes that have that 



R. Giancarlo and D. Sankoff (Eds.): CPM 2000, LNCS 1848, pp. 143-^SSI 2002. 
Springer- Verlag Berlin Heidelberg 2002 



144 



Itsik Pe’er, Ron Shamir, and Roded Sharan 



state induces a connected subtree. The general perfect phylogeny problem is NP- 
hard gEO]. When considering the number of possible states per character as a 
parameter, the problem is fixed parameter tractable For binary characters^ 

having only two states, perfect phylogeny is linear-time solvable P. 

Another common variant of phylogenetic reconstruction is the parsimony 
problem, which calls for a solution with fewest state changes altogether. A 
change is counted whenever the state of a character changes between a species 
and an offspring species. This problem is known to be NP-hard |BI. A special 
case introduced by Camin and Sokal jS| assumes that characters are binary and 
directed, namely, only 0 — > 1 changes may occur. Noting by 1 and 0 the pres- 
ence and absence, respectively, of the character, this means that characters can 
only be gained during evolution. Another related binary variant is Dollo par- 
simony |!Traj . which assumes that the change 0 — > 1 may happen only once, 
i.e., a character can be gained once, but it can be lost several times. Both of 
these variants are polynomially solvable (cf. 0). When no perfect phylogeny is 
possible, one can seek a largest subset of the characters which admits a perfect 
phylogeny. Characters in such a subset are said to be compatible. Compatibility 
problems have been studied extensively (see, e.g., g7]). 

In this paper, we discuss a generalization of binary perfect phylogeny which 
combines the assumptions of both Camin-Sokal parsimony and Dollo parsimony. 
The setup is as follows: The characters are binary and directed. As in perfect 
phylogeny, the input is a matrix of character vectors, with the difference that 
some character states are missing. The question is whether one can complete 
the missing states in a way admitting a perfect phylogeny. We call this problem 
Incomplete Directed Perfect phylogeny (IDP). 

The problem of handling incomplete phylogenetic data arises whenever some 
of the data is missing. It is also encountered in the context of morphological 
characters, where for some species it may be impossible to reliably assign a state 
to a character. The popular PAUP software package provides an exponential 
solution to the problem by exhaustively searching the space of missing states. 
Indeed, the problem of determining whether a set of incomplete undirected char- 
acters is compatible was shown to be NP-complete, even in the case of binary 
characters 1^. 

Quite recently, a novel kind of genomic data has given rise to the same prob- 
lem: Nikaido et al. use inserted repetitive genomic elements, particularly 
SINEs, as a source of evolutionary information. SINEs are short DNA sequences 
that were copied and randomly reinserted into various genomic loci during evo- 
lution. The specific insertion events are identifiable by the flanking sequences 
on both sides of the insertion site. These insertions are assumed to be unique 
events in evolution, because the odds of having separate insertion events at the 
very same locus are negligible. Furthermore, a SINE insertion is assumed to be 
irreversible, i.e., once a SINE sequence has been inserted somewhere along the 
genome, it is practically impossible for the exact, complete SINE to leave that 
specific locus. However, the site and its flanking sequences may be lost when a 
large genomic region, which includes them, is deleted. In that case we do not 
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know whether an insertion had occurred in the missing site. One can model such 
data by assigning to each locus a character, whose state is ’1’ if the SINE oc- 
curred in that locus, ’0’ if the locus is present but does not contain the SINE, 
and ’?’ if the locus is missing. The resulting reconstruction problem is precisely 
Incomplete Directed Perfect phylogeny. 

The incomplete perfect phylogeny problem becomes polynomial when the 
characters are directed: Benham et al. studied the compatibility problem on 
generalized characters. Their work implies an 0(n^m)-time algorithm for IDP, 
where n and m denote the number of species and characters, respectively. A 
problem related to IDP is the consensus tree problem ETHl . This problem calls 
for constructing a consensus tree from homeomorphic binary subtrees, and is 
solvable in polynomial time. One can reduce IDP to the latter problem, but the 
reduction may take time. 

Our approach to the IDP problem is graph theoretic. We first provide sev- 
eral graph and matrix characterizations for solvable instances of binary directed 
perfect phylogeny. We then reformulate IDP as a graph sandwich problem: The 
input data is recast into two graphs, and solving IDP is shown to be equiva- 
lent to finding a graph of a particular type ’’sandwiched” between them. This 
formulation allows us to devise a polynomial algorithm for IDP. The determin- 
istic complexity of the algorithm is shown to be 0{nm + k\o^{n + m)), for an 
instance with k 1-entries in the matrix. Alternatively, we give a randomized ver- 
sion of the algorithm which takes 0{nm+klog{P /k) + l{logl)^ log log Z) expected 
time, where I = n + m. Since an fl(nm) lower bound was shown by Gusfield 
for directed binary perfect phylogeny our algorithm has near optimal time 
complexity. 

The paper is organized as follows: In SectionEIwe provide some preliminaries. 
Section 01 gives the characterizations and the graph sandwich formulation. In 
Section 0 we present the polynomial algorithm. For lack of space, some proofs 
are shortened or omitted. 

2 Problem Formulation 

We first specify some terminology and notation. We reserve the terms nodes and 
branches for trees, and will use the terms vertices and edges for other graphs. 
Matrices are denoted by an upper-case letter, while their elements are denoted 
by the corresponding lower-case letter. 

Let G = {V,E) be a graph. We denote its set of vertices also by V{G), and 
its set of edges also by E{G). For a vertex v gV we define its degree to a subset 
R CV to be the number of edges connecting v to vertices in R. The length of a 
path in G is the number of edges along it. 

Let T be a rooted tree over a leaf set S with branches directed from the root 
towards the leaves. For a node a; in T we denote the leaf set of the subtree rooted 
at X by L{x). L{x) is called a clade of T. For consistency, we consider 0 to be a 
clade of T as well, and call it the empty clade. S, 0 and all singletons are called 
trivial clades. We denote by triv{S) the collection of all trivial clades. Two sets 
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are said to be compatible if they are either disjoint, or one of them contains the 
other. 

Observation 1. A collection S of subsets of a set S is the set of eludes of some 
tree over S iff S contains triv{S) and its subsets are pairwise compatible. 

A tree is uniquely characterized by the set of its clades. The transformation 
between a branch-node representation of a tree and a list of its clades is straight- 
forward. Hereafter, we identify a tree T with the set {S' : S" is a clade of T}. 
Let S' be a subset of the leaves of T. Then the subtree of T induced on S is the 
collection {S n S' : S' G T} (which defines a tree). 

Throughout the paper we denote by S = {si, . . . , s„} the set of all species 
and by C = {ci, . . . ,Cm} the set of all (binary) characters. For a graph K, we 
define C{K) = C D V{K) and S{K) = Sn V{K). Let Bnxm be a binary matrix 
whose rows correspond to species, each row being the character-vector of the 
corresponding species. That is, bij = 1 iff the species Si has the character Cj. 
A phylogenetic tree for ;B is a rooted tree T with n leaves corresponding to the 
n species of S, such that each character Cj is associated with a clade S' of T, 
satisfying: 

(1) Si G S'' iff bij = 1. 

(2) Every non-trivial clade of T is associated with at least one character. 

For a character c, the node x of T, whose clade L{x) is associated with c, is 
called the origin of c w.r.t. T. 

Let Anxm be a {0, 1,?} matrix in which = 1 if Si has cj, = 0 if Si 
lacks Cj, and Uij =? if it is not known whether Si has cj. A is called incomplete 
if it contains at least one ’?’. For a character Cj and a value x G {0,?, 1}, the 
x-set of Cj in A is the set of species Cj{A,x) = {si : a^- = a;}. Cj is called a null 
character if its 1-set is empty. 

A binary matrix B is called a completion of A if G {bij, 7} for all i,j. Thus, 
a completion replaces all the ?-s in A by zeroes and ones. If B has a phylogenetic 
tree T, we say that T is a phylogenetic tree for A as well. We also say that 
T explains A via B, and that A is explainable. An example of an incomplete 
matrix A, a completion of A, and a phylogenetic tree which explains A, is given 
in Figure D 

The following lemma, closely related to Observation □ has been proven in- 
dependently by several authors: 

Lemma 1. A binary matrix B has a phylogenetic tree iff the 1-sets of every two 
characters are compatible. 

An analogous lemma holds for undirected characters (cf. (HI)- In contrast, 
for incomplete matrices, even if every pair of columns has a phylogenetic tree, 
the full matrix might not have one. Such an example was provided by ( 7 ) for 
undirected characters. We provide a simpler example for directed characters in 
Figure 0 
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Fig. 1. Left to right: An incomplete matrix A, a completion B of A, and a phylogenetic 
tree that explains A via B. A character x to the right of a vertex v means that v is the 
origin of x. 
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Fig. 2. An incomplete matrix which has no phylogenetic tree although every pair of 
its columns has one. 



We are now ready to state the IDP problem: 

Incomplete Directed Perfect Phylogeny (IDP): 

Instance: An incomplete matrix A. 

Goal: Find a tree which explains A, or determine that no such tree exists. 

3 Characterizations of Explainable Binary Matrices 

3.1 Forbidden Subgraph Characterization 

Let ;B be a species-characters binary matrix of order nxm. Construct the bipar- 
tite graph G{B) = (S', C, E) with E = {(si, Cj) : bij = 1}. For a subset S' C S of 
species, we say that a character c is S' -universal in B, if S' C c{B, 1). 

An induced path of length four in G{B) is called a E subgraph if it starts 
(and therefore ends) at a vertex corresponding to a species (see Figure E|). A 
graph with no induced E subgraph is said to be E-free. 

Proposition 1. If G{B) is connected and E-free, then there exists a character 
which is S -universal in B. 

Proof. Suppose to the contrary that G{B) has no S-universal character. Consider 
the collection of all 1-sets of characters in B. Let c be a character whose 1-set 
is maximal w.r.t. inclusion in this collection. Let s" be a species which lacks c. 
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Characters Species 







Fig. 3. The E subgraph. 



Since c has a non-empty 1-set and G{B) is connected, there exists a path from 
s" to c in G{B). Consider the shortest such path P. Since G{B) is bipartite, the 
length of P is odd. P cannot be of length 1, by the choice of s” . Furthermore, 
if P is of length greater than 3, then its first 5 vertices induce a E subgraph, a 
contradiction. Thus P = (s", c', s', c) must be of length 3. By maximality of the 
1-set of c, it is not contained in the 1-set of c'. Hence, there exists a species s 
which has the character c but lacks c'. Together with the vertices of P, s induces 
a E subgraph, as depicted in Figure 0 a contradiction. 

The following theorem restates Lemma din terms of graph theory. 

Theorem 1. B has a phylogenetic tree iff G{B) is E-free. 



Corollary 1. Let A he a submatrix of A. If A is explainable, then so is A. 
Furthermore, if T explains A, then A is explained by the subtree of T induced 
on its rows. 

Let F he a, graph property. In the F sandwich problem one is given a vertex 
set V and a partition of {V x : y g V} into three subsets: Eq - 

the forbidden edges, Ei - the mandatory edges, and if? - the optional edges. 
The objective is to find a supergraph of (V,Ei) which satisfies F and contains 
no forbidden edges. For the property of having no induced E subgraphs, the 
problem is formally defined as follows: 
if-Ftee- Sandwich: 

Instance: A vertex set V, and two disjoint edge sets Eg, E\ over V. 

Question: Is there a set F of edges such that F D Ei, F D Eq = 0, and the 
graph {V, E) satisfies FI 

Proposition 2. E -free- sandwich is equivalent to IDP. 

Hence, the required graph {V,E) must be “sandwiched” between (V,Ei) and 
(y, ifi U if?). The reader is referred to 1 1 Dfb] for a discussion of various sandwich 
problems. 

Theorem ^motivates looking at the IDP problem with input A as an instance 
{{S , C) , E-^ , E:f , E-^) of the E-free sandwich problem. Here, E^ = {(si,Cj) : 
Oij = x}, for a; = 0, ?, 1. In the sequel, we omit the superscript A when it is clear 
from the context. 

Note, that there is an obvious 1-1 correspondence between completions of A 
and possible solutions of {{S,G),Eq,E'?,Ei). Hence, in the sequel we refer to 
matrices and their corresponding sandwich instances interchangeably. 
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3.2 Forbidden Submatrix Characterizations 

A binary matrix B is called canonical if it can be decomposed as follows: 

(1) Its fco > 0 left columns are all zero. 

(2) Its next fci > 0 columns are all one. 

(3) There exist canonical matrices B\,. . . ,Bi, such that the rest (0 or more) of 
the columns of B form the block-structure illustrated in Figure El 

We say that a matrix B avoids a matrix X, if no submatrix of B is identical 
to X. 



Theorem 2. Let B be a binary matrix. The following are equivalent: 



1. B has a phylogenetic tree. 

2. G{B) is S-free. 

3. Every matrix obtained by permuting the rows and columns of B avoids the 
following matrix: 



Z = 



1 1 
1 0 
0 1 



4 . There exists an ordering of the rows and columns of B which yields a eanon- 
ical matrix. 

5. There exists an ordering of the rows and columns of B so that the resulting 
matrix avoids the following matrices: 



Ai = 





1 1 
0 1 



,A4 



1 

0 

1 



Proof. 

EMU Theorem m 
Trivial. 

EHU Suppose T is a tree that explains B. Assign to each node of T an index 
which equals its position in a preorder visit of T. Sort all characters according 
to the preorder indices of their origin nodes (letting null characters come 
first). Sort all species according to the preorder indices of their corresponding 
leaves in T. The result is a canonical matrix. 

Sl^El Easily verifiable from the definition of canonical matrices. 

lEMHI Suppose to the contrary that B has an ordering of its rows and columns, 
so that rows ii,i 2 , is and columns ji,j 2 of the resulting matrix compose the 
submatrix Z. Consider the permutations 6 row, Ocoi of the rows and columns 
of B, respectively, which yield a matrix avoiding X\, . . . , A 4 . In this ordering, 
row 6row{ii) necessarily lies between rows 0 row(i 2 ) and 9row{is), or else, 
the submatrix A 4 occurs. Suppose that 0row{i2) < 9row{is) and 6coi{ji) < 
0coi{j2), then A 3 occurs, a contradiction. The remaining cases are similar. 
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Note, that a matrix which avoids X 4 has the consecutive ones property in 
columns. Gusfield im Theorem 3] has proven that a matrix which has a phyloge- 
netic tree can be reordered so as to satisfy that property. In fact, the reordering 
used by Gusfield’s proof generates an essentially canonical matrix. 
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Fig. 4. Recursive definition of canonical matrices. Each Bi is constructed in the same 
manner. 



Klinz et al. m study and review numerous problems of permuting matrices 
to avoid forbidden submatrices. 

4 The Algorithm 

Let A be the input matrix. Define G^{A) = (S', C, E-^) for x =?, 1. For a subset 
0 7 ^ S' C S, we say that a character is S' -semi-universal in A if its 0-set does 
not intersect S'. 

We now present an algorithm for solving IDP. The algorithm outputs a tree 
T which explains A, or outputs False if no such tree exists. 

1. G ^ G^{A),T ^ triv{S). 

2. Remove all S-semi-universal and all null characters from G. 

3. While E{G) ^ 0 do: 

(a) For each connected component K of G such that \E{K)\ > 1 do: 

i. Let S^ S(RT). 

ii. Gompute the set U of all characters in K which are S-semi-universal 
in A. 

iii. If [7 = 0 then output False and halt. 

iv. Otherwise, remove U from G and update T ^ T U {S}. 

4. Output T. 

Theorem 3. The above algorithm eorreetly solves IDP. 

Proof. Suppose that the algorithm returns False. Then there exists an iteration 
of the ’while’ loop at which some connected component K contained no S{K)~ 
semi-universal character. Suppose to the contrary that some F* solves A, and let 
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G* = {S, C, F*). Let H* be the subgraph of G* induced by V (K). By definition, 
H* is connected and by TheoreniQl it is also Li-free. Therefore, by Proposition E 
it has an S'(Lf)-universal character. That character must be S'(iL)-semi-universal 
in A, a contradiction. 

On the other hand, suppose that the algorithm returns a collection T of 
sets. We shall prove that T is a tree which explains A. We first prove that the 
collection T of sets is pairwise compatible, implying, by Observation [Q that T 
is a tree. Let , S 2 be two subsets in T. Let ti denote the iteration of the ’while’ 
loop at which Si was added to T, for i = 1,2. If = ^2 then and S 2 are 
clearly disjoint. Otherwise, suppose w.l.o.g., that G < t 2 and S'inS '2 7 ^ 0- Let Ki 
denote the connected component containing Si at iteration ti of the algorithm. 
The edge set of G at iteration ti contains the one at iteration t 2 - Therefore, Ki 
contains the vertices in S 2 - It follows that Si ^ S 2 - 

It remains to show that the resulting tree is a phylogenetic tree for A. Suppose 
that a character c was removed from G as a part of some set U in Step |3(a)iv| 
Associate with c the clade S, which was added to T at that same step. Observe, 
that each non-trivial clade S G T is associated with at least one character. 
Associate with each S'-semi-universal character the clade S. Associate with each 
null character the empty clade. Finally, define a binary matrix Bnxm with bsc = 1 
iff s belongs to the clade Sc associated with c. Since Osc yf 1 for all s ^ Sc and 
Use yf 0 for all s e S'c, is a completion of A. The claim follows. 



The algorithm can be naively implemented in 0{hnm) time, where h < 
min{m,n} denotes the height of the reconstructed tree. This can be seen by 
noting that each iteration of the ’while’ loop increases the height of the output 
tree by one. We give a better bound in the following theorem. 

Theorem 4. The complexity of the algorithm is 0(nm+\Ei\ log^(n+m)) deter- 
ministic time. Alternatively, there exists a randomized algorithm that solves IDP 
in 0{nm-\- |Ai | log(P/|ifi|) -I- Z(log^)^ loglog expected time, where I = n-\-m. 



Proof. Each iteration of the ’while’ loop splits the (potential) clades added in 
the previous one. Thus, the algorithm performs an iteration per level of the 
tree returned, and at most h iterations. The connected components of G can 
be initialized in 0{\Ei \ -\- n -\- m) time, and maintained using a dynamic data 
structure for graph connectivity. Using the dynamic algorithm of m the con- 
nected components of G can be maintained during \Ei \ edge deletions at a cost of 
0{\Ei I log^(n-l-TO)) time spent in Step |3(a)i^ Alternatively, using the Las- Vegas 
type randomized algorithm for decremental dynamic connectivity m, the edge 
deletions can be supported in 0{\Ei\\og{P /\Ei\) Z(logZ)^loglogZ) expected 
time. 

The connected components of G must be explicitly recomputed from the 
dynamic data structure for each iteration. This takes 0{h{m -\- n)) = 0(nm) 
time in total. Since each set added to T in Step 3 ( a) iv| corresponds to at least 
one character, and each character is associated with exactly one set, updating 
T requires in total Ofnm) time. 

It remains to show how semi-universal characters can be efficiently found in 
Step |3(a)ii[ Let G{f) denote the graph G at iteration t of the ’while’ loop. For 
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a character c, denote by dl(t) its degree in G{t), and by dl(t) the degree of c 
in G^{A) to its connected component Kc in G{t). Given d\{t),d\{i), one can 
check in 0(1) time whether c is S'(RTc)-semi-universal. d\{t) remains unchanged 
throughout, and equals dl(t) can be maintained as follows. dl{l) is initial- 

ized in 0{\E7\) time (given the connected components of 0(1)). At the beginning 
of iteration t, d].{t+ 1) is initialized to d'[.{t). Each time a connected component 
Kc of G{t) is split into sub-components K\, . . . ,Ki due to the removal of char- 
acters in Step |3(a)T^ we update d\{t+l) as follows: For c e C{Kj), we decrease 
dl{t + 1) by |j(s, c) € E-? : s G S{Kp),p ^ j}|. This takes 0(|if?|) time for all 
c, t. Finally, finding the semi-universal characters over all iterations costs 0{hm) 
time. The complexity follows. 

We remark, that an Q(rnn) time lower bound for (undirected) binary perfect 
phylogeny was proven by Gusfield m- A closer look at Gusfield’s proof reveals 
that it applies, as is, also to the directed case. As IDP generalizes directed binary 
perfect phylogeny, any algorithm for this problem would require fl(mn) time. 

5 Concluding Remarks 

We have given a polynomial algorithm for reconstructing a phylogeny from in- 
complete binary directed data, using a graph theoretic reformulation of the prob- 
lem. The algorithm is near optimal and takes 0{nm) time. 

An interesting question regarding IDP is whether one can identify if there 
exists a “most general” solution, so that all others are obtained from it by 
refinements (additional clades). We have proven that in case a “most general” 
solution exists, the algorithm described here provides that solution. The full 
version of this manuscript will include this proof, along with a more general 
polynomial time algorithm, that also determines if such a solution exists. 
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